1. 06 8月, 2022 1 次提交
  2. 01 8月, 2022 1 次提交
  3. 07 7月, 2022 2 次提交
    • D
      xfs: xlog_sync() manually adjusts grant head space · d9f68777
      Dave Chinner 提交于
      When xlog_sync() rounds off the tail the iclog that is being
      flushed, it manually subtracts that space from the grant heads. This
      space is actually reserved by the transaction ticket that covers
      the xlog_sync() call from xlog_write(), but we don't plumb the
      ticket down far enough for it to account for the space consumed in
      the current log ticket.
      
      The grant heads are hot, so we really should be accounting this to
      the ticket is we can, rather than adding thousands of extra grant
      head updates every CIL commit.
      
      Interestingly, this actually indicates a potential log space overrun
      can occur when we force the log. By the time that xfs_log_force()
      pushes out an active iclog and consumes the roundoff space, the
      reservation for that roundoff space has been returned to the grant
      heads and is no longer covered by a reservation. In theory the
      roundoff added to log force on an already full log could push the
      write head past the tail. In practice, the CIL commit that writes to
      the log and needs the iclog pushed will have reserved space for
      roundoff, so when it releases the ticket there will still be
      physical space for the roundoff to be committed to the log, even
      though it is no longer reserved. This roundoff won't be enough space
      to allow a transaction to be woken if the log is full, so overruns
      should not actually occur in practice.
      
      That said, it indicates that we should not release the CIL context
      log ticket until after we've released the commit iclog. It also
      means that xlog_sync() still needs the direct grant head
      manipulation if we don't provide it with a ticket. Log forces are
      rare when we are in fast paths running 1.5 million transactions/s
      that make the grant heads hot, so let's optimise the hot case and
      pass CIL log tickets down to the xlog_sync() code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      d9f68777
    • D
      xfs: convert log vector chain to use list heads · 16924853
      Dave Chinner 提交于
      Because the next change is going to require sorting log vectors, and
      that requires arbitrary rearrangement of the list which cannot be
      done easily with a single linked list.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      16924853
  4. 02 7月, 2022 2 次提交
    • D
      xfs: rework per-iclog header CIL reservation · 31151cc3
      Dave Chinner 提交于
      For every iclog that a CIL push will use up, we need to ensure we
      have space reserved for the iclog header in each iclog. It is
      extremely difficult to do this accurately with a per-cpu counter
      without expensive summing of the counter in every commit. However,
      we know what the maximum CIL size is going to be because of the
      hard space limit we have, and hence we know exactly how many iclogs
      we are going to need to write out the CIL.
      
      We are constrained by the requirement that small transactions only
      have reservation space for a single iclog header built into them.
      At commit time we don't know how much of the current transaction
      reservation is made up of iclog header reservations as calculated by
      xfs_log_calc_unit_res() when the ticket was reserved. As larger
      reservations have multiple header spaces reserved, we can steal
      more than one iclog header reservation at a time, but we only steal
      the exact number needed for the given log vector size delta.
      
      As a result, we don't know exactly when we are going to steal iclog
      header reservations, nor do we know exactly how many we are going to
      need for a given CIL.
      
      To make things simple, start by calculating the worst case number of
      iclog headers a full CIL push will require. Record this into an
      atomic variable in the CIL. Then add a byte counter to the log
      ticket that records exactly how much iclog header space has been
      reserved in this ticket by xfs_log_calc_unit_res(). This tells us
      exactly how much space we can steal from the ticket at transaction
      commit time.
      
      Now, at transaction commit time, we can check if the CIL has a full
      iclog header reservation and, if not, steal the entire reservation
      the current ticket holds for iclog headers. This minimises the
      number of times we need to do atomic operations in the fast path,
      but still guarantees we get all the reservations we need.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      31151cc3
    • D
      xfs: prevent a UAF when log IO errors race with unmount · 7561cea5
      Darrick J. Wong 提交于
      KASAN reported the following use after free bug when running
      generic/475:
      
       XFS (dm-0): Mounting V5 Filesystem
       XFS (dm-0): Starting recovery (logdev: internal)
       XFS (dm-0): Ending recovery (logdev: internal)
       Buffer I/O error on dev dm-0, logical block 20639616, async page read
       Buffer I/O error on dev dm-0, logical block 20639617, async page read
       XFS (dm-0): log I/O error -5
       XFS (dm-0): Filesystem has been shut down due to log error (0x2).
       XFS (dm-0): Unmounting Filesystem
       XFS (dm-0): Please unmount the filesystem and rectify the problem(s).
       ==================================================================
       BUG: KASAN: use-after-free in do_raw_spin_lock+0x246/0x270
       Read of size 4 at addr ffff888109dd84c4 by task 3:1H/136
      
       CPU: 3 PID: 136 Comm: 3:1H Not tainted 5.19.0-rc4-xfsx #rc4 8e53ab5ad0fddeb31cee5e7063ff9c361915a9c4
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
       Workqueue: xfs-log/dm-0 xlog_ioend_work [xfs]
       Call Trace:
        <TASK>
        dump_stack_lvl+0x34/0x44
        print_report.cold+0x2b8/0x661
        ? do_raw_spin_lock+0x246/0x270
        kasan_report+0xab/0x120
        ? do_raw_spin_lock+0x246/0x270
        do_raw_spin_lock+0x246/0x270
        ? rwlock_bug.part.0+0x90/0x90
        xlog_force_shutdown+0xf6/0x370 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318]
        xlog_ioend_work+0x100/0x190 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318]
        process_one_work+0x672/0x1040
        worker_thread+0x59b/0xec0
        ? __kthread_parkme+0xc6/0x1f0
        ? process_one_work+0x1040/0x1040
        ? process_one_work+0x1040/0x1040
        kthread+0x29e/0x340
        ? kthread_complete_and_exit+0x20/0x20
        ret_from_fork+0x1f/0x30
        </TASK>
      
       Allocated by task 154099:
        kasan_save_stack+0x1e/0x40
        __kasan_kmalloc+0x81/0xa0
        kmem_alloc+0x8d/0x2e0 [xfs]
        xlog_cil_init+0x1f/0x540 [xfs]
        xlog_alloc_log+0xd1e/0x1260 [xfs]
        xfs_log_mount+0xba/0x640 [xfs]
        xfs_mountfs+0xf2b/0x1d00 [xfs]
        xfs_fs_fill_super+0x10af/0x1910 [xfs]
        get_tree_bdev+0x383/0x670
        vfs_get_tree+0x7d/0x240
        path_mount+0xdb7/0x1890
        __x64_sys_mount+0x1fa/0x270
        do_syscall_64+0x2b/0x80
        entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
       Freed by task 154151:
        kasan_save_stack+0x1e/0x40
        kasan_set_track+0x21/0x30
        kasan_set_free_info+0x20/0x30
        ____kasan_slab_free+0x110/0x190
        slab_free_freelist_hook+0xab/0x180
        kfree+0xbc/0x310
        xlog_dealloc_log+0x1b/0x2b0 [xfs]
        xfs_unmountfs+0x119/0x200 [xfs]
        xfs_fs_put_super+0x6e/0x2e0 [xfs]
        generic_shutdown_super+0x12b/0x3a0
        kill_block_super+0x95/0xd0
        deactivate_locked_super+0x80/0x130
        cleanup_mnt+0x329/0x4d0
        task_work_run+0xc5/0x160
        exit_to_user_mode_prepare+0xd4/0xe0
        syscall_exit_to_user_mode+0x1d/0x40
        entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      This appears to be a race between the unmount process, which frees the
      CIL and waits for in-flight iclog IO; and the iclog IO completion.  When
      generic/475 runs, it starts fsstress in the background, waits a few
      seconds, and substitutes a dm-error device to simulate a disk falling
      out of a machine.  If the fsstress encounters EIO on a pure data write,
      it will exit but the filesystem will still be online.
      
      The next thing the test does is unmount the filesystem, which tries to
      clean the log, free the CIL, and wait for iclog IO completion.  If an
      iclog was being written when the dm-error switch occurred, it can race
      with log unmounting as follows:
      
      Thread 1				Thread 2
      
      					xfs_log_unmount
      					xfs_log_clean
      					xfs_log_quiesce
      xlog_ioend_work
      <observe error>
      xlog_force_shutdown
      test_and_set_bit(XLOG_IOERROR)
      					xfs_log_force
      					<log is shut down, nop>
      					xfs_log_umount_write
      					<log is shut down, nop>
      					xlog_dealloc_log
      					xlog_cil_destroy
      					<wait for iclogs>
      spin_lock(&log->l_cilp->xc_push_lock)
      <KABOOM>
      
      Therefore, free the CIL after waiting for the iclogs to complete.  I
      /think/ this race has existed for quite a few years now, though I don't
      remember the ~2014 era logging code well enough to know if it was a real
      threat then or if the actual race was exposed only more recently.
      
      Fixes: ac983517 ("xfs: don't sleep in xlog_cil_force_lsn on shutdown")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      7561cea5
  5. 27 5月, 2022 2 次提交
  6. 11 5月, 2022 1 次提交
  7. 21 4月, 2022 14 次提交
  8. 30 3月, 2022 5 次提交
    • D
      xfs: drop async cache flushes from CIL commits. · 919edbad
      Dave Chinner 提交于
      Jan Kara reported a performance regression in dbench that he
      bisected down to commit bad77c37 ("xfs: CIL checkpoint
      flushes caches unconditionally").
      
      Whilst developing the journal flush/fua optimisations this cache was
      part of, it appeared to made a significant difference to
      performance. However, now that this patchset has settled and all the
      correctness issues fixed, there does not appear to be any
      significant performance benefit to asynchronous cache flushes.
      
      In fact, the opposite is true on some storage types and workloads,
      where additional cache flushes that can occur from fsync heavy
      workloads have measurable and significant impact on overall
      throughput.
      
      Local dbench testing shows little difference on dbench runs with
      sync vs async cache flushes on either fast or slow SSD storage, and
      no difference in streaming concurrent async transaction workloads
      like fs-mark.
      
      Fast NVME storage.
      
      From `dbench -t 30`, CIL scale:
      
      clients		async			sync
      		BW	Latency		BW	Latency
      1		 935.18   0.855		 915.64   0.903
      8		2404.51   6.873		2341.77   6.511
      16		3003.42   6.460		2931.57   6.529
      32		3697.23   7.939		3596.28   7.894
      128		7237.43  15.495		7217.74  11.588
      512		5079.24  90.587		5167.08  95.822
      
      fsmark, 32 threads, create w/ 64 byte xattr w/32k logbsize
      
      	create		chown		unlink
      async   1m41s		1m16s		2m03s
      sync	1m40s		1m19s		1m54s
      
      Slower SATA SSD storage:
      
      From `dbench -t 30`, CIL scale:
      
      clients		async			sync
      		BW	Latency		BW	Latency
      1		  78.59  15.792		  83.78  10.729
      8		 367.88  92.067		 404.63  59.943
      16		 564.51  72.524		 602.71  76.089
      32		 831.66 105.984		 870.26 110.482
      128		1659.76 102.969		1624.73  91.356
      512		2135.91 223.054		2603.07 161.160
      
      fsmark, 16 threads, create w/32k logbsize
      
      	create		unlink
      async   5m06s		4m15s
      sync	5m00s		4m22s
      
      And on Jan's test machine:
      
                         5.18-rc8-vanilla       5.18-rc8-patched
      Amean     1        71.22 (   0.00%)       64.94 *   8.81%*
      Amean     2        93.03 (   0.00%)       84.80 *   8.85%*
      Amean     4       150.54 (   0.00%)      137.51 *   8.66%*
      Amean     8       252.53 (   0.00%)      242.24 *   4.08%*
      Amean     16      454.13 (   0.00%)      439.08 *   3.31%*
      Amean     32      835.24 (   0.00%)      829.74 *   0.66%*
      Amean     64     1740.59 (   0.00%)     1686.73 *   3.09%*
      
      Performance and cache flush behaviour is restored to pre-regression
      levels.
      
      As such, we can now consider the async cache flush mechanism an
      unnecessary exercise in premature optimisation and hence we can
      now remove it and the infrastructure it requires completely.
      
      Fixes: bad77c37 ("xfs: CIL checkpoint flushes caches unconditionally")
      Reported-and-tested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      919edbad
    • D
      xfs: shutdown during log recovery needs to mark the log shutdown · 5652ef31
      Dave Chinner 提交于
      When a checkpoint writeback is run by log recovery, corruption
      propagated from the log can result in writeback verifiers failing
      and calling xfs_force_shutdown() from
      xfs_buf_delwri_submit_buffers().
      
      This results in the mount being marked as shutdown, but the log does
      not get marked as shut down because:
      
              /*
               * If this happens during log recovery then we aren't using the runtime
               * log mechanisms yet so there's nothing to shut down.
               */
              if (!log || xlog_in_recovery(log))
                      return false;
      
      If there are other buffers that then fail (say due to detecting the
      mount shutdown), they will now hang in xfs_do_force_shutdown()
      waiting for the log to shut down like this:
      
        __schedule+0x30d/0x9e0
        schedule+0x55/0xd0
        xfs_do_force_shutdown+0x1cd/0x200
        ? init_wait_var_entry+0x50/0x50
        xfs_buf_ioend+0x47e/0x530
        __xfs_buf_submit+0xb0/0x240
        xfs_buf_delwri_submit_buffers+0xfe/0x270
        xfs_buf_delwri_submit+0x3a/0xc0
        xlog_do_recovery_pass+0x474/0x7b0
        ? do_raw_spin_unlock+0x30/0xb0
        xlog_do_log_recovery+0x91/0x140
        xlog_do_recover+0x38/0x1e0
        xlog_recover+0xdd/0x170
        xfs_log_mount+0x17e/0x2e0
        xfs_mountfs+0x457/0x930
        xfs_fs_fill_super+0x476/0x830
      
      xlog_force_shutdown() always needs to mark the log as shut down,
      regardless of whether recovery is in progress or not, so that
      multiple calls to xfs_force_shutdown() during recovery don't end
      up waiting for the log to be shut down like this.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      5652ef31
    • D
      xfs: xfs_do_force_shutdown needs to block racing shutdowns · 41e63621
      Dave Chinner 提交于
      When we call xfs_forced_shutdown(), the caller often expects the
      filesystem to be completely shut down when it returns. However,
      if we have racing xfs_forced_shutdown() calls, the first caller sets
      the mount shutdown flag then goes to shutdown the log. The second
      caller sees the mount shutdown flag and returns immediately - it
      does not wait for the log to be shut down.
      
      Unfortunately, xfs_forced_shutdown() is used in some places that
      expect it to completely shut down the filesystem before it returns
      (e.g. xfs_trans_log_inode()). As such, returning before the log has
      been shut down leaves us in a place where the transaction failed to
      complete correctly but we still call xfs_trans_commit(). This
      situation arises because xfs_trans_log_inode() does not return an
      error and instead calls xfs_force_shutdown() to ensure that the
      transaction being committed is aborted.
      
      Unfortunately, we have a race condition where xfs_trans_commit()
      needs to check xlog_is_shutdown() because it can't abort log items
      before the log is shut down, but it needs to use xfs_is_shutdown()
      because xfs_forced_shutdown() does not block waiting for the log to
      shut down.
      
      To fix this conundrum, first we make all calls to
      xfs_forced_shutdown() block until the log is also shut down. This
      means we can then safely use xfs_forced_shutdown() as a mechanism
      that ensures the currently running transaction will be aborted by
      xfs_trans_commit() regardless of the shutdown check it uses.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      41e63621
    • D
      xfs: log shutdown triggers should only shut down the log · b5f17bec
      Dave Chinner 提交于
      We've got a mess on our hands.
      
      1. xfs_trans_commit() cannot cancel transactions because the mount is
      shut down - that causes dirty, aborted, unlogged log items to sit
      unpinned in memory and potentially get written to disk before the
      log is shut down. Hence xfs_trans_commit() can only abort
      transactions when xlog_is_shutdown() is true.
      
      2. xfs_force_shutdown() is used in places to cause the current
      modification to be aborted via xfs_trans_commit() because it may be
      impractical or impossible to cancel the transaction directly, and
      hence xfs_trans_commit() must cancel transactions when
      xfs_is_shutdown() is true in this situation. But we can't do that
      because of #1.
      
      3. Log IO errors cause log shutdowns by calling xfs_force_shutdown()
      to shut down the mount and then the log from log IO completion.
      
      4. xfs_force_shutdown() can result in a log force being issued,
      which has to wait for log IO completion before it will mark the log
      as shut down. If #3 races with some other shutdown trigger that runs
      a log force, we rely on xfs_force_shutdown() silently ignoring #3
      and avoiding shutting down the log until the failed log force
      completes.
      
      5. To ensure #2 always works, we have to ensure that
      xfs_force_shutdown() does not return until the the log is shut down.
      But in the case of #4, this will result in a deadlock because the
      log Io completion will block waiting for a log force to complete
      which is blocked waiting for log IO to complete....
      
      So the very first thing we have to do here to untangle this mess is
      dissociate log shutdown triggers from mount shutdowns. We already
      have xlog_forced_shutdown, which will atomically transistion to the
      log a shutdown state. Due to internal asserts it cannot be called
      multiple times, but was done simply because the only place that
      could call it was xfs_do_force_shutdown() (i.e. the mount shutdown!)
      and that could only call it once and once only.  So the first thing
      we do is remove the asserts.
      
      We then convert all the internal log shutdown triggers to call
      xlog_force_shutdown() directly instead of xfs_force_shutdown(). This
      allows the log shutdown triggers to shut down the log without
      needing to care about mount based shutdown constraints. This means
      we shut down the log independently of the mount and the mount may
      not notice this until it's next attempt to read or modify metadata.
      At that point (e.g. xfs_trans_commit()) it will see that the log is
      shutdown, error out and shutdown the mount.
      
      To ensure that all the unmount behaviours and asserts track
      correctly as a result of a log shutdown, propagate the shutdown up
      to the mount if it is not already set. This keeps the mount and log
      state in sync, and saves a huge amount of hassle where code fails
      because of a log shutdown but only checks for mount shutdowns and
      hence ends up doing the wrong thing. Cleaning up that mess is
      an exercise for another day.
      
      This enables us to address the other problems noted above in
      followup patches.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      b5f17bec
    • D
      xfs: run callbacks before waking waiters in xlog_state_shutdown_callbacks · cd6f79d1
      Dave Chinner 提交于
      Brian reported a null pointer dereference failure during unmount in
      xfs/006. He tracked the problem down to the AIL being torn down
      before a log shutdown had completed and removed all the items from
      the AIL. The failure occurred in this path while unmount was
      proceeding in another task:
      
       xfs_trans_ail_delete+0x102/0x130 [xfs]
       xfs_buf_item_done+0x22/0x30 [xfs]
       xfs_buf_ioend+0x73/0x4d0 [xfs]
       xfs_trans_committed_bulk+0x17e/0x2f0 [xfs]
       xlog_cil_committed+0x2a9/0x300 [xfs]
       xlog_cil_process_committed+0x69/0x80 [xfs]
       xlog_state_shutdown_callbacks+0xce/0xf0 [xfs]
       xlog_force_shutdown+0xdf/0x150 [xfs]
       xfs_do_force_shutdown+0x5f/0x150 [xfs]
       xlog_ioend_work+0x71/0x80 [xfs]
       process_one_work+0x1c5/0x390
       worker_thread+0x30/0x350
       kthread+0xd7/0x100
       ret_from_fork+0x1f/0x30
      
      This is processing an EIO error to a log write, and it's
      triggering a force shutdown. This causes the log to be shut down,
      and then it is running attached iclog callbacks from the shutdown
      context. That means the fs and log has already been marked as
      xfs_is_shutdown/xlog_is_shutdown and so high level code will abort
      (e.g. xfs_trans_commit(), xfs_log_force(), etc) with an error
      because of shutdown.
      
      The umount would have been blocked waiting for a log force
      completion inside xfs_log_cover() -> xfs_sync_sb(). The first thing
      for this situation to occur is for xfs_sync_sb() to exit without
      waiting for the iclog buffer to be comitted to disk. The
      above trace is the completion routine for the iclog buffer, and
      it is shutting down the filesystem.
      
      xlog_state_shutdown_callbacks() does this:
      
      {
              struct xlog_in_core     *iclog;
              LIST_HEAD(cb_list);
      
              spin_lock(&log->l_icloglock);
              iclog = log->l_iclog;
              do {
                      if (atomic_read(&iclog->ic_refcnt)) {
                              /* Reference holder will re-run iclog callbacks. */
                              continue;
                      }
                      list_splice_init(&iclog->ic_callbacks, &cb_list);
      >>>>>>           wake_up_all(&iclog->ic_write_wait);
      >>>>>>           wake_up_all(&iclog->ic_force_wait);
              } while ((iclog = iclog->ic_next) != log->l_iclog);
      
              wake_up_all(&log->l_flush_wait);
              spin_unlock(&log->l_icloglock);
      
      >>>>>>  xlog_cil_process_committed(&cb_list);
      }
      
      This wakes any thread waiting on IO completion of the iclog (in this
      case the umount log force) before shutdown processes all the pending
      callbacks.  That means the xfs_sync_sb() waiting on a sync
      transaction in xfs_log_force() on iclog->ic_force_wait will get
      woken before the callbacks attached to that iclog are run. This
      results in xfs_sync_sb() returning an error, and so unmount unblocks
      and continues to run whilst the log shutdown is still in progress.
      
      Normally this is just fine because the force waiter has nothing to
      do with AIL operations. But in the case of this unmount path, the
      log force waiter goes on to tear down the AIL because the log is now
      shut down and so nothing ever blocks it again from the wait point in
      xfs_log_cover().
      
      Hence it's a race to see who gets to the AIL first - the unmount
      code or xlog_cil_process_committed() killing the superblock buffer.
      
      To fix this, we just have to change the order of processing in
      xlog_state_shutdown_callbacks() to run the callbacks before it wakes
      any task waiting on completion of the iclog.
      Reported-by: NBrian Foster <bfoster@redhat.com>
      Fixes: aad7272a ("xfs: separate out log shutdown callback processing")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      cd6f79d1
  9. 20 3月, 2022 2 次提交
    • D
      xfs: log items should have a xlog pointer, not a mount · d86142dd
      Dave Chinner 提交于
      Log items belong to the log, not the xfs_mount. Convert the mount
      pointer in the log item to a xlog pointer in preparation for
      upcoming log centric changes to the log items.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      d86142dd
    • D
      xfs: log worker needs to start before intent/unlink recovery · a9a4bc8c
      Dave Chinner 提交于
      After 963 iterations of generic/530, it deadlocked during recovery
      on a pinned inode cluster buffer like so:
      
      XFS (pmem1): Starting recovery (logdev: internal)
      INFO: task kworker/8:0:306037 blocked for more than 122 seconds.
            Not tainted 5.17.0-rc6-dgc+ #975
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      task:kworker/8:0     state:D stack:13024 pid:306037 ppid:     2 flags:0x00004000
      Workqueue: xfs-inodegc/pmem1 xfs_inodegc_worker
      Call Trace:
       <TASK>
       __schedule+0x30d/0x9e0
       schedule+0x55/0xd0
       schedule_timeout+0x114/0x160
       __down+0x99/0xf0
       down+0x5e/0x70
       xfs_buf_lock+0x36/0xf0
       xfs_buf_find+0x418/0x850
       xfs_buf_get_map+0x47/0x380
       xfs_buf_read_map+0x54/0x240
       xfs_trans_read_buf_map+0x1bd/0x490
       xfs_imap_to_bp+0x4f/0x70
       xfs_iunlink_map_ino+0x66/0xd0
       xfs_iunlink_map_prev.constprop.0+0x148/0x2f0
       xfs_iunlink_remove_inode+0xf2/0x1d0
       xfs_inactive_ifree+0x1a3/0x900
       xfs_inode_unlink+0xcc/0x210
       xfs_inodegc_worker+0x1ac/0x2f0
       process_one_work+0x1ac/0x390
       worker_thread+0x56/0x3c0
       kthread+0xf6/0x120
       ret_from_fork+0x1f/0x30
       </TASK>
      task:mount           state:D stack:13248 pid:324509 ppid:324233 flags:0x00004000
      Call Trace:
       <TASK>
       __schedule+0x30d/0x9e0
       schedule+0x55/0xd0
       schedule_timeout+0x114/0x160
       __down+0x99/0xf0
       down+0x5e/0x70
       xfs_buf_lock+0x36/0xf0
       xfs_buf_find+0x418/0x850
       xfs_buf_get_map+0x47/0x380
       xfs_buf_read_map+0x54/0x240
       xfs_trans_read_buf_map+0x1bd/0x490
       xfs_imap_to_bp+0x4f/0x70
       xfs_iget+0x300/0xb40
       xlog_recover_process_one_iunlink+0x4c/0x170
       xlog_recover_process_iunlinks.isra.0+0xee/0x130
       xlog_recover_finish+0x57/0x110
       xfs_log_mount_finish+0xfc/0x1e0
       xfs_mountfs+0x540/0x910
       xfs_fs_fill_super+0x495/0x850
       get_tree_bdev+0x171/0x270
       xfs_fs_get_tree+0x15/0x20
       vfs_get_tree+0x24/0xc0
       path_mount+0x304/0xba0
       __x64_sys_mount+0x108/0x140
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
       </TASK>
      task:xfsaild/pmem1   state:D stack:14544 pid:324525 ppid:     2 flags:0x00004000
      Call Trace:
       <TASK>
       __schedule+0x30d/0x9e0
       schedule+0x55/0xd0
       io_schedule+0x4b/0x80
       xfs_buf_wait_unpin+0x9e/0xf0
       __xfs_buf_submit+0x14a/0x230
       xfs_buf_delwri_submit_buffers+0x107/0x280
       xfs_buf_delwri_submit_nowait+0x10/0x20
       xfsaild+0x27e/0x9d0
       kthread+0xf6/0x120
       ret_from_fork+0x1f/0x30
      
      We have the mount process waiting on an inode cluster buffer read,
      inodegc doing unlink waiting on the same inode cluster buffer, and
      the AIL push thread blocked in writeback waiting for the inode
      cluster buffer to become unpinned.
      
      What has happened here is that the AIL push thread has raced with
      the inodegc process modifying, committing and pinning the inode
      cluster buffer here in xfs_buf_delwri_submit_buffers() here:
      
      	blk_start_plug(&plug);
      	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
      		if (!wait_list) {
      			if (xfs_buf_ispinned(bp)) {
      				pinned++;
      				continue;
      			}
      Here >>>>>>
      			if (!xfs_buf_trylock(bp))
      				continue;
      
      Basically, the AIL has found the buffer wasn't pinned and got the
      lock without blocking, but then the buffer was pinned. This implies
      the processing here was pre-empted between the pin check and the
      lock, because the pin count can only be increased while holding the
      buffer locked. Hence when it has gone to submit the IO, it has
      blocked waiting for the buffer to be unpinned.
      
      With all executing threads now waiting on the buffer to be unpinned,
      we normally get out of situations like this via the background log
      worker issuing a log force which will unpinned stuck buffers like
      this. But at this point in recovery, we haven't started the log
      worker. In fact, the first thing we do after processing intents and
      unlinked inodes is *start the log worker*. IOWs, we start it too
      late to have it break deadlocks like this.
      
      Avoid this and any other similar deadlock vectors in intent and
      unlinked inode recovery by starting the log worker before we recover
      intents and unlinked inodes. This part of recovery runs as though
      the filesystem is fully active, so we really should have the same
      infrastructure running as we normally do at runtime.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      a9a4bc8c
  10. 02 2月, 2022 1 次提交
  11. 23 10月, 2021 2 次提交
  12. 20 8月, 2021 3 次提交
  13. 17 8月, 2021 4 次提交
    • D
      xfs: AIL needs asynchronous CIL forcing · 0020a190
      Dave Chinner 提交于
      The AIL pushing is stalling on log forces when it comes across
      pinned items. This is happening on removal workloads where the AIL
      is dominated by stale items that are removed from AIL when the
      checkpoint that marks the items stale is committed to the journal.
      This results is relatively few items in the AIL, but those that are
      are often pinned as directories items are being removed from are
      still being logged.
      
      As a result, many push cycles through the CIL will first issue a
      blocking log force to unpin the items. This can take some time to
      complete, with tracing regularly showing push delays of half a
      second and sometimes up into the range of several seconds. Sequences
      like this aren't uncommon:
      
      ....
       399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
      <wanted 20ms, got 270ms delay>
       400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
       400.099623:  xfsaild: first lsn 0x11002f3600
       400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
      <wanted 50ms, got 500ms delay>
       400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
       400.589349:  xfsaild: first lsn 0x1100305000
       400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
      <wanted 50ms, got 460ms delay>
       400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
       400.950343:  xfsaild: first lsn 0x1100317c00
       400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
      <wanted 20ms, got 200ms delay>
       401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
       401.142334:  xfsaild: first lsn 0x110032e600
       401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
      <wanted 10ms, got 10ms delay>
       401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
       401.154328:  xfsaild: first lsn 0x1100353000
       401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
      <wanted 20ms, got 300ms delay>
       401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
       401.451526:  xfsaild: first lsn 0x1100353000
       401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
      <wanted 50ms, got 500ms delay>
       401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
      ....
      
      In each of these cases, every AIL pass saw 101 log items stuck on
      the AIL (pinned) with very few other items being found. Each pass, a
      log force was issued, and delay between last/first is the sleep time
      + the sync log force time.
      
      Some of these 101 items pinned the tail of the log. The tail of the
      log does slowly creep forward (first lsn), but the problem is that
      the log is actually out of reservation space because it's been
      running so many transactions that stale items that never reach the
      AIL but consume log space. Hence we have a largely empty AIL, with
      long term pins on items that pin the tail of the log that don't get
      pushed frequently enough to keep log space available.
      
      The problem is the hundreds of milliseconds that we block in the log
      force pushing the CIL out to disk. The AIL should not be stalled
      like this - it needs to run and flush items that are at the tail of
      the log with minimal latency. What we really need to do is trigger a
      log flush, but then not wait for it at all - we've already done our
      waiting for stuff to complete when we backed off prior to the log
      force being issued.
      
      Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
      still do a blocking flush of the CIL and that is what is causing the
      issue. Hence we need a new interface for the CIL to trigger an
      immediate background push of the CIL to get it moving faster but not
      to wait on that to occur. While the CIL is pushing, the AIL can also
      be pushing.
      
      We already have an internal interface to do this -
      xlog_cil_push_now() - but we need a wrapper for it to be used
      externally. xlog_cil_force_seq() can easily be extended to do what
      we need as it already implements the synchronous CIL push via
      xlog_cil_push_now(). Add the necessary flags and "push current
      sequence" semantics to xlog_cil_force_seq() and convert the AIL
      pushing to use it.
      
      One of the complexities here is that the CIL push does not guarantee
      that the commit record for the CIL checkpoint is written to disk.
      The current log force ensures this by submitting the current ACTIVE
      iclog that the commit record was written to. We need the CIL to
      actually write this commit record to disk for an async push to
      ensure that the checkpoint actually makes it to disk and unpins the
      pinned items in the checkpoint on completion. Hence we need to pass
      down to the CIL push that we are doing an async flush so that it can
      switch out the commit_iclog if necessary to get written to disk when
      the commit iclog is finally released.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      0020a190
    • D
      xfs: order CIL checkpoint start records · 68a74dca
      Dave Chinner 提交于
      Because log recovery depends on strictly ordered start records as
      well as strictly ordered commit records.
      
      This is a zero day bug in the way XFS writes pipelined transactions
      to the journal which is exposed by fixing the zero day bug that
      prevents the CIL from pipelining checkpoints. This re-introduces
      explicit concurrent commits back into the on-disk journal and hence
      out of order start records.
      
      The XFS journal commit code has never ordered start records and we
      have relied on strict commit record ordering for correct recovery
      ordering of concurrently written transactions. Unfortunately, root
      cause analysis uncovered the fact that log recovery uses the LSN of
      the start record for transaction commit processing. Hence, whilst
      the commits are processed in strict order by recovery, the LSNs
      associated with the commits can be out of order and so recovery may
      stamp incorrect LSNs into objects and/or misorder intents in the AIL
      for later processing. This can result in log recovery failures
      and/or on disk corruption, sometimes silent.
      
      Because this is a long standing log recovery issue, we can't just
      fix log recovery and call it good. This still leaves older kernels
      susceptible to recovery failures and corruption when replaying a log
      from a kernel that pipelines checkpoints. There is also the issue
      that in-memory ordering for AIL pushing and data integrity
      operations are based on checkpoint start LSNs, and if the start LSN
      is incorrect in the journal, it is also incorrect in memory.
      
      Hence there's really only one choice for fixing this zero-day bug:
      we need to strictly order checkpoint start records in ascending
      sequence order in the log, the same way we already strictly order
      commit records.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      68a74dca
    • D
      xfs: attach iclog callbacks in xlog_cil_set_ctx_write_state() · caa80090
      Dave Chinner 提交于
      Now that we have a mechanism to guarantee that the callbacks
      attached to an iclog are owned by the context that attaches them
      until they drop their reference to the iclog via
      xlog_state_release_iclog(), we can attach callbacks to the iclog at
      any time we have an active reference to the iclog.
      
      xlog_state_get_iclog_space() always guarantees that the commit
      record will fit in the iclog it returns, so we can move this IO
      callback setting to xlog_cil_set_ctx_write_state(), record the
      commit iclog in the context and remove the need for the commit iclog
      to be returned by xlog_write() altogether.
      
      This, in turn, allows us to move the wakeup for ordered commit
      record writes up into xlog_cil_set_ctx_write_state(), too, because
      we have been guaranteed that this commit record will be physically
      located in the iclog before any waiting commit record at a higher
      sequence number will be granted iclog space.
      
      This further cleans up the post commit record write processing in
      the CIL push code, especially as xlog_state_release_iclog() will now
      clean up the context when shutdown errors occur.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      caa80090
    • D
      xfs: pass a CIL context to xlog_write() · c45aba40
      Dave Chinner 提交于
      Pass the CIL context to xlog_write() rather than a pointer to a LSN
      variable. Only the CIL checkpoint calls to xlog_write() need to know
      about the start LSN of the writes, so rework xlog_write to directly
      write the LSNs into the CIL context structure.
      
      This removes the commit_lsn variable from xlog_cil_push_work(), so
      now we only have to issue the commit record ordering wakeup from
      there.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      c45aba40