1. 20 8月, 2021 8 次提交
  2. 19 8月, 2021 17 次提交
  3. 17 8月, 2021 15 次提交
    • D
      xfs: move the CIL workqueue to the CIL · 33c0dd78
      Dave Chinner 提交于
      We only use the CIL workqueue in the CIL, so it makes no sense to
      hang it off the xfs_mount and have to walk multiple pointers back up
      to the mount when we have the CIL structures right there.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      33c0dd78
    • D
      xfs: CIL work is serialised, not pipelined · 39823d0f
      Dave Chinner 提交于
      Because we use a single work structure attached to the CIL rather
      than the CIL context, we can only queue a single work item at a
      time. This results in the CIL being single threaded and limits
      performance when it becomes CPU bound.
      
      The design of the CIL is that it is pipelined and multiple commits
      can be running concurrently, but the way the work is currently
      implemented means that it is not pipelining as it was intended. The
      critical work to switch the CIL context can take a few milliseconds
      to run, but the rest of the CIL context flush can take hundreds of
      milliseconds to complete. The context switching is the serialisation
      point of the CIL, once the context has been switched the rest of the
      context push can run asynchrnously with all other context pushes.
      
      Hence we can move the work to the CIL context so that we can run
      multiple CIL pushes at the same time and spread the majority of
      the work out over multiple CPUs. We can keep the per-cpu CIL commit
      state on the CIL rather than the context, because the context is
      pinned to the CIL until the switch is done and we aggregate and
      drain the per-cpu state held on the CIL during the context switch.
      
      However, because we no longer serialise the CIL work, we can have
      effectively unlimited CIL pushes in progress. We don't want to do
      this - not only does it create contention on the iclogs and the
      state machine locks, we can run the log right out of space with
      outstanding pushes. Instead, limit the work concurrency to 4
      concurrent works being processed at a time. This is enough
      concurrency to remove the CIL from being a CPU bound bottleneck but
      not enough to create new contention points or unbound concurrency
      issues.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      39823d0f
    • D
      xfs: AIL needs asynchronous CIL forcing · 0020a190
      Dave Chinner 提交于
      The AIL pushing is stalling on log forces when it comes across
      pinned items. This is happening on removal workloads where the AIL
      is dominated by stale items that are removed from AIL when the
      checkpoint that marks the items stale is committed to the journal.
      This results is relatively few items in the AIL, but those that are
      are often pinned as directories items are being removed from are
      still being logged.
      
      As a result, many push cycles through the CIL will first issue a
      blocking log force to unpin the items. This can take some time to
      complete, with tracing regularly showing push delays of half a
      second and sometimes up into the range of several seconds. Sequences
      like this aren't uncommon:
      
      ....
       399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
      <wanted 20ms, got 270ms delay>
       400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
       400.099623:  xfsaild: first lsn 0x11002f3600
       400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
      <wanted 50ms, got 500ms delay>
       400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
       400.589349:  xfsaild: first lsn 0x1100305000
       400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
      <wanted 50ms, got 460ms delay>
       400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
       400.950343:  xfsaild: first lsn 0x1100317c00
       400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
      <wanted 20ms, got 200ms delay>
       401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
       401.142334:  xfsaild: first lsn 0x110032e600
       401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
      <wanted 10ms, got 10ms delay>
       401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
       401.154328:  xfsaild: first lsn 0x1100353000
       401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
      <wanted 20ms, got 300ms delay>
       401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
       401.451526:  xfsaild: first lsn 0x1100353000
       401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
      <wanted 50ms, got 500ms delay>
       401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
      ....
      
      In each of these cases, every AIL pass saw 101 log items stuck on
      the AIL (pinned) with very few other items being found. Each pass, a
      log force was issued, and delay between last/first is the sleep time
      + the sync log force time.
      
      Some of these 101 items pinned the tail of the log. The tail of the
      log does slowly creep forward (first lsn), but the problem is that
      the log is actually out of reservation space because it's been
      running so many transactions that stale items that never reach the
      AIL but consume log space. Hence we have a largely empty AIL, with
      long term pins on items that pin the tail of the log that don't get
      pushed frequently enough to keep log space available.
      
      The problem is the hundreds of milliseconds that we block in the log
      force pushing the CIL out to disk. The AIL should not be stalled
      like this - it needs to run and flush items that are at the tail of
      the log with minimal latency. What we really need to do is trigger a
      log flush, but then not wait for it at all - we've already done our
      waiting for stuff to complete when we backed off prior to the log
      force being issued.
      
      Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
      still do a blocking flush of the CIL and that is what is causing the
      issue. Hence we need a new interface for the CIL to trigger an
      immediate background push of the CIL to get it moving faster but not
      to wait on that to occur. While the CIL is pushing, the AIL can also
      be pushing.
      
      We already have an internal interface to do this -
      xlog_cil_push_now() - but we need a wrapper for it to be used
      externally. xlog_cil_force_seq() can easily be extended to do what
      we need as it already implements the synchronous CIL push via
      xlog_cil_push_now(). Add the necessary flags and "push current
      sequence" semantics to xlog_cil_force_seq() and convert the AIL
      pushing to use it.
      
      One of the complexities here is that the CIL push does not guarantee
      that the commit record for the CIL checkpoint is written to disk.
      The current log force ensures this by submitting the current ACTIVE
      iclog that the commit record was written to. We need the CIL to
      actually write this commit record to disk for an async push to
      ensure that the checkpoint actually makes it to disk and unpins the
      pinned items in the checkpoint on completion. Hence we need to pass
      down to the CIL push that we are doing an async flush so that it can
      switch out the commit_iclog if necessary to get written to disk when
      the commit iclog is finally released.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      0020a190
    • D
      xfs: order CIL checkpoint start records · 68a74dca
      Dave Chinner 提交于
      Because log recovery depends on strictly ordered start records as
      well as strictly ordered commit records.
      
      This is a zero day bug in the way XFS writes pipelined transactions
      to the journal which is exposed by fixing the zero day bug that
      prevents the CIL from pipelining checkpoints. This re-introduces
      explicit concurrent commits back into the on-disk journal and hence
      out of order start records.
      
      The XFS journal commit code has never ordered start records and we
      have relied on strict commit record ordering for correct recovery
      ordering of concurrently written transactions. Unfortunately, root
      cause analysis uncovered the fact that log recovery uses the LSN of
      the start record for transaction commit processing. Hence, whilst
      the commits are processed in strict order by recovery, the LSNs
      associated with the commits can be out of order and so recovery may
      stamp incorrect LSNs into objects and/or misorder intents in the AIL
      for later processing. This can result in log recovery failures
      and/or on disk corruption, sometimes silent.
      
      Because this is a long standing log recovery issue, we can't just
      fix log recovery and call it good. This still leaves older kernels
      susceptible to recovery failures and corruption when replaying a log
      from a kernel that pipelines checkpoints. There is also the issue
      that in-memory ordering for AIL pushing and data integrity
      operations are based on checkpoint start LSNs, and if the start LSN
      is incorrect in the journal, it is also incorrect in memory.
      
      Hence there's really only one choice for fixing this zero-day bug:
      we need to strictly order checkpoint start records in ascending
      sequence order in the log, the same way we already strictly order
      commit records.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      68a74dca
    • D
      xfs: attach iclog callbacks in xlog_cil_set_ctx_write_state() · caa80090
      Dave Chinner 提交于
      Now that we have a mechanism to guarantee that the callbacks
      attached to an iclog are owned by the context that attaches them
      until they drop their reference to the iclog via
      xlog_state_release_iclog(), we can attach callbacks to the iclog at
      any time we have an active reference to the iclog.
      
      xlog_state_get_iclog_space() always guarantees that the commit
      record will fit in the iclog it returns, so we can move this IO
      callback setting to xlog_cil_set_ctx_write_state(), record the
      commit iclog in the context and remove the need for the commit iclog
      to be returned by xlog_write() altogether.
      
      This, in turn, allows us to move the wakeup for ordered commit
      record writes up into xlog_cil_set_ctx_write_state(), too, because
      we have been guaranteed that this commit record will be physically
      located in the iclog before any waiting commit record at a higher
      sequence number will be granted iclog space.
      
      This further cleans up the post commit record write processing in
      the CIL push code, especially as xlog_state_release_iclog() will now
      clean up the context when shutdown errors occur.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      caa80090
    • D
      xfs: factor out log write ordering from xlog_cil_push_work() · bf034bc8
      Dave Chinner 提交于
      So we can use it for start record ordering as well as commit record
      ordering in future.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      bf034bc8
    • D
      xfs: pass a CIL context to xlog_write() · c45aba40
      Dave Chinner 提交于
      Pass the CIL context to xlog_write() rather than a pointer to a LSN
      variable. Only the CIL checkpoint calls to xlog_write() need to know
      about the start LSN of the writes, so rework xlog_write to directly
      write the LSNs into the CIL context structure.
      
      This removes the commit_lsn variable from xlog_cil_push_work(), so
      now we only have to issue the commit record ordering wakeup from
      there.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      c45aba40
    • D
      xfs: move xlog_commit_record to xfs_log_cil.c · 2ce82b72
      Dave Chinner 提交于
      It is only used by the CIL checkpoints, and is the counterpart to
      start record formatting and writing that is already local to
      xfs_log_cil.c.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      2ce82b72
    • D
      xfs: log head and tail aren't reliable during shutdown · 2562c322
      Dave Chinner 提交于
      I'm seeing assert failures from xlog_space_left() after a shutdown
      has begun that look like:
      
      XFS (dm-0): log I/O error -5
      XFS (dm-0): xfs_do_force_shutdown(0x2) called from line 1338 of file fs/xfs/xfs_log.c. Return address = xlog_ioend_work+0x64/0xc0
      XFS (dm-0): Log I/O Error Detected.
      XFS (dm-0): Shutting down filesystem. Please unmount the filesystem and rectify the problem(s)
      XFS (dm-0): xlog_space_left: head behind tail
      XFS (dm-0):   tail_cycle = 6, tail_bytes = 2706944
      XFS (dm-0):   GH   cycle = 6, GH   bytes = 1633867
      XFS: Assertion failed: 0, file: fs/xfs/xfs_log.c, line: 1310
      ------------[ cut here ]------------
      Call Trace:
       xlog_space_left+0xc3/0x110
       xlog_grant_push_threshold+0x3f/0xf0
       xlog_grant_push_ail+0x12/0x40
       xfs_log_reserve+0xd2/0x270
       ? __might_sleep+0x4b/0x80
       xfs_trans_reserve+0x18b/0x260
      .....
      
      There are two things here. Firstly, after a shutdown, the log head
      and tail can be out of whack as things abort and release (or don't
      release) resources, so checking them for sanity doesn't make much
      sense. Secondly, xfs_log_reserve() can race with shutdown and so it
      can still fail like this even though it has already checked for a
      log shutdown before calling xlog_grant_push_ail().
      
      So, before ASSERT failing in xlog_space_left(), make sure we haven't
      already shut down....
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      2562c322
    • D
      xfs: don't run shutdown callbacks on active iclogs · 502a01fa
      Dave Chinner 提交于
      When the log is shutdown, it currently walks all the iclogs and runs
      callbacks that are attached to the iclogs, regardless of whether the
      iclog is queued for IO completion or not. This creates a problem for
      contexts attaching callbacks to iclogs in that a racing shutdown can
      run the callbacks even before the attaching context has finished
      processing the iclog and releasing it for IO submission.
      
      If the callback processing of the iclog frees the structure that is
      attached to the iclog, then this leads to an UAF scenario that can
      only be protected against by holding the icloglock from the point
      callbacks are attached through to the release of the iclog. While we
      currently do this, it is not practical or sustainable.
      
      Hence we need to make shutdown processing the responsibility of the
      context that holds active references to the iclog. We know that the
      contexts attaching callbacks to the iclog must have active
      references to the iclog, and that means they must be in either
      ACTIVE or WANT_SYNC states. xlog_state_do_callback() will skip over
      iclogs in these states -except- when the log is shut down.
      
      xlog_state_do_callback() checks the state of the iclogs while
      holding the icloglock, therefore the reference count/state change
      that occurs in xlog_state_release_iclog() after the callbacks are
      atomic w.r.t. shutdown processing.
      
      We can't push the responsibility of callback cleanup onto the CIL
      context because we can have ACTIVE iclogs that have callbacks
      attached that have already been released. Hence we really need to
      internalise the cleanup of callbacks into xlog_state_release_iclog()
      processing.
      
      Indeed, we already have that internalisation via:
      
      xlog_state_release_iclog
        drop last reference
          ->SYNCING
        xlog_sync
          xlog_write_iclog
            if (log_is_shutdown)
              xlog_state_done_syncing()
      	  xlog_state_do_callback()
      	    <process shutdown on iclog that is now in SYNCING state>
      
      The problem is that xlog_state_release_iclog() aborts before doing
      anything if the log is already shut down. It assumes that the
      callbacks have already been cleaned up, and it doesn't need to do
      any cleanup.
      
      Hence the fix is to remove the xlog_is_shutdown() check from
      xlog_state_release_iclog() so that reference counts are correctly
      released from the iclogs, and when the reference count is zero we
      always transition to SYNCING if the log is shut down. Hence we'll
      always enter the xlog_sync() path in a shutdown and eventually end
      up erroring out the iclog IO and running xlog_state_do_callback() to
      process the callbacks attached to the iclog.
      
      This allows us to stop processing referenced ACTIVE/WANT_SYNC iclogs
      directly in the shutdown code, and in doing so gets rid of the UAF
      vector that currently exists. This then decouples the adding of
      callbacks to the iclogs from xlog_state_release_iclog() as we
      guarantee that xlog_state_release_iclog() will process the callbacks
      if the log has been shut down before xlog_state_release_iclog() has
      been called.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      502a01fa
    • D
      xfs: separate out log shutdown callback processing · aad7272a
      Dave Chinner 提交于
      The iclog callback processing done during a forced log shutdown has
      different logic to normal runtime IO completion callback processing.
      Separate out the shutdown callbacks into their own function and call
      that from the shutdown code instead.
      
      We don't need this shutdown specific logic in the normal runtime
      completion code - we'll always run the shutdown version on shutdown,
      and it will do what shutdown needs regardless of whether there are
      racing IO completion callbacks scheduled or in progress. Hence we
      can also simplify the normal IO completion callpath and only abort
      if shutdown occurred while we actively were processing callbacks.
      
      Further, separating out the IO completion logic from the shutdown
      logic avoids callback race conditions from being triggered by log IO
      completion after a shutdown. IO completion will now only run
      callbacks on iclogs that are in the correct state for a callback to
      be run, avoiding the possibility of running callbacks on a
      referenced iclog that hasn't yet been submitted for IO.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      aad7272a
    • D
      xfs: rework xlog_state_do_callback() · 8bb92005
      Dave Chinner 提交于
      Clean it up a bit by factoring and rearranging some of the code.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      8bb92005
    • D
      xfs: make forced shutdown processing atomic · b36d4651
      Dave Chinner 提交于
      The running of a forced shutdown is a bit of a mess. It does racy
      checks for XFS_MOUNT_SHUTDOWN in xfs_do_force_shutdown(), then
      does more racy checks in xfs_log_force_unmount() before finally
      setting XFS_MOUNT_SHUTDOWN and XLOG_IO_ERROR under the
      log->icloglock.
      
      Move the checking and setting of XFS_MOUNT_SHUTDOWN into
      xfs_do_force_shutdown() so we only process a shutdown once and once
      only. Serialise this with the mp->m_sb_lock spinlock so that the
      state change is atomic and won't race. Move all the mount specific
      shutdown state changes from xfs_log_force_unmount() to
      xfs_do_force_shutdown() so they are done atomically with setting
      XFS_MOUNT_SHUTDOWN.
      
      Then get rid of the racy xlog_is_shutdown() check from
      xlog_force_shutdown(), and gate the log shutdown on the
      test_and_set_bit(XLOG_IO_ERROR) test under the icloglock. This
      means that the log is shutdown once and once only, and code that
      needs to prevent races with shutdown can do so by holding the
      icloglock and checking the return value of xlog_is_shutdown().
      
      This results in a predictable shutdown execution process - we set the
      shutdown flags once and process the shutdown once rather than the
      current "as many concurrent shutdowns as can race to the flag
      setting" situation we have now.
      
      Also, now that shutdown is atomic, alway emit a stack trace when the
      error level for the filesystem is high enough. This means that we
      always get a stack trace when trying to diagnose the cause of
      shutdowns in the field, rather than just for SHUTDOWN_CORRUPT_INCORE
      cases.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      b36d4651
    • D
      xfs: convert log flags to an operational state field · e1d06e5f
      Dave Chinner 提交于
      log->l_flags doesn't actually contain "flags" as such, it contains
      operational state information that can change at runtime. For the
      shutdown state, this at least should be an atomic bit because
      it is read without holding locks in many places and so using atomic
      bitops for the state field modifications makes sense.
      
      This allows us to use things like test_and_set_bit() on state
      changes (e.g. setting XLOG_TAIL_WARN) to avoid races in setting the
      state when we aren't holding locks.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      e1d06e5f
    • D
      xfs: move recovery needed state updates to xfs_log_mount_finish · fd67d8a0
      Dave Chinner 提交于
      xfs_log_mount_finish() needs to know if recovery is needed or not to
      make decisions on whether to flush the log and AIL.  Move the
      handling of the NEED_RECOVERY state out to this function rather than
      needing a temporary variable to store this state over the call to
      xlog_recover_finish().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      fd67d8a0