1. 17 8月, 2021 9 次提交
    • D
      xfs: AIL needs asynchronous CIL forcing · 0020a190
      Dave Chinner 提交于
      The AIL pushing is stalling on log forces when it comes across
      pinned items. This is happening on removal workloads where the AIL
      is dominated by stale items that are removed from AIL when the
      checkpoint that marks the items stale is committed to the journal.
      This results is relatively few items in the AIL, but those that are
      are often pinned as directories items are being removed from are
      still being logged.
      
      As a result, many push cycles through the CIL will first issue a
      blocking log force to unpin the items. This can take some time to
      complete, with tracing regularly showing push delays of half a
      second and sometimes up into the range of several seconds. Sequences
      like this aren't uncommon:
      
      ....
       399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
      <wanted 20ms, got 270ms delay>
       400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
       400.099623:  xfsaild: first lsn 0x11002f3600
       400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
      <wanted 50ms, got 500ms delay>
       400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
       400.589349:  xfsaild: first lsn 0x1100305000
       400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
      <wanted 50ms, got 460ms delay>
       400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
       400.950343:  xfsaild: first lsn 0x1100317c00
       400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
      <wanted 20ms, got 200ms delay>
       401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
       401.142334:  xfsaild: first lsn 0x110032e600
       401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
      <wanted 10ms, got 10ms delay>
       401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
       401.154328:  xfsaild: first lsn 0x1100353000
       401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
      <wanted 20ms, got 300ms delay>
       401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
       401.451526:  xfsaild: first lsn 0x1100353000
       401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
      <wanted 50ms, got 500ms delay>
       401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
      ....
      
      In each of these cases, every AIL pass saw 101 log items stuck on
      the AIL (pinned) with very few other items being found. Each pass, a
      log force was issued, and delay between last/first is the sleep time
      + the sync log force time.
      
      Some of these 101 items pinned the tail of the log. The tail of the
      log does slowly creep forward (first lsn), but the problem is that
      the log is actually out of reservation space because it's been
      running so many transactions that stale items that never reach the
      AIL but consume log space. Hence we have a largely empty AIL, with
      long term pins on items that pin the tail of the log that don't get
      pushed frequently enough to keep log space available.
      
      The problem is the hundreds of milliseconds that we block in the log
      force pushing the CIL out to disk. The AIL should not be stalled
      like this - it needs to run and flush items that are at the tail of
      the log with minimal latency. What we really need to do is trigger a
      log flush, but then not wait for it at all - we've already done our
      waiting for stuff to complete when we backed off prior to the log
      force being issued.
      
      Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
      still do a blocking flush of the CIL and that is what is causing the
      issue. Hence we need a new interface for the CIL to trigger an
      immediate background push of the CIL to get it moving faster but not
      to wait on that to occur. While the CIL is pushing, the AIL can also
      be pushing.
      
      We already have an internal interface to do this -
      xlog_cil_push_now() - but we need a wrapper for it to be used
      externally. xlog_cil_force_seq() can easily be extended to do what
      we need as it already implements the synchronous CIL push via
      xlog_cil_push_now(). Add the necessary flags and "push current
      sequence" semantics to xlog_cil_force_seq() and convert the AIL
      pushing to use it.
      
      One of the complexities here is that the CIL push does not guarantee
      that the commit record for the CIL checkpoint is written to disk.
      The current log force ensures this by submitting the current ACTIVE
      iclog that the commit record was written to. We need the CIL to
      actually write this commit record to disk for an async push to
      ensure that the checkpoint actually makes it to disk and unpins the
      pinned items in the checkpoint on completion. Hence we need to pass
      down to the CIL push that we are doing an async flush so that it can
      switch out the commit_iclog if necessary to get written to disk when
      the commit iclog is finally released.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      0020a190
    • D
      xfs: order CIL checkpoint start records · 68a74dca
      Dave Chinner 提交于
      Because log recovery depends on strictly ordered start records as
      well as strictly ordered commit records.
      
      This is a zero day bug in the way XFS writes pipelined transactions
      to the journal which is exposed by fixing the zero day bug that
      prevents the CIL from pipelining checkpoints. This re-introduces
      explicit concurrent commits back into the on-disk journal and hence
      out of order start records.
      
      The XFS journal commit code has never ordered start records and we
      have relied on strict commit record ordering for correct recovery
      ordering of concurrently written transactions. Unfortunately, root
      cause analysis uncovered the fact that log recovery uses the LSN of
      the start record for transaction commit processing. Hence, whilst
      the commits are processed in strict order by recovery, the LSNs
      associated with the commits can be out of order and so recovery may
      stamp incorrect LSNs into objects and/or misorder intents in the AIL
      for later processing. This can result in log recovery failures
      and/or on disk corruption, sometimes silent.
      
      Because this is a long standing log recovery issue, we can't just
      fix log recovery and call it good. This still leaves older kernels
      susceptible to recovery failures and corruption when replaying a log
      from a kernel that pipelines checkpoints. There is also the issue
      that in-memory ordering for AIL pushing and data integrity
      operations are based on checkpoint start LSNs, and if the start LSN
      is incorrect in the journal, it is also incorrect in memory.
      
      Hence there's really only one choice for fixing this zero-day bug:
      we need to strictly order checkpoint start records in ascending
      sequence order in the log, the same way we already strictly order
      commit records.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      68a74dca
    • D
      xfs: attach iclog callbacks in xlog_cil_set_ctx_write_state() · caa80090
      Dave Chinner 提交于
      Now that we have a mechanism to guarantee that the callbacks
      attached to an iclog are owned by the context that attaches them
      until they drop their reference to the iclog via
      xlog_state_release_iclog(), we can attach callbacks to the iclog at
      any time we have an active reference to the iclog.
      
      xlog_state_get_iclog_space() always guarantees that the commit
      record will fit in the iclog it returns, so we can move this IO
      callback setting to xlog_cil_set_ctx_write_state(), record the
      commit iclog in the context and remove the need for the commit iclog
      to be returned by xlog_write() altogether.
      
      This, in turn, allows us to move the wakeup for ordered commit
      record writes up into xlog_cil_set_ctx_write_state(), too, because
      we have been guaranteed that this commit record will be physically
      located in the iclog before any waiting commit record at a higher
      sequence number will be granted iclog space.
      
      This further cleans up the post commit record write processing in
      the CIL push code, especially as xlog_state_release_iclog() will now
      clean up the context when shutdown errors occur.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      caa80090
    • D
      xfs: factor out log write ordering from xlog_cil_push_work() · bf034bc8
      Dave Chinner 提交于
      So we can use it for start record ordering as well as commit record
      ordering in future.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      bf034bc8
    • D
      xfs: pass a CIL context to xlog_write() · c45aba40
      Dave Chinner 提交于
      Pass the CIL context to xlog_write() rather than a pointer to a LSN
      variable. Only the CIL checkpoint calls to xlog_write() need to know
      about the start LSN of the writes, so rework xlog_write to directly
      write the LSNs into the CIL context structure.
      
      This removes the commit_lsn variable from xlog_cil_push_work(), so
      now we only have to issue the commit record ordering wakeup from
      there.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      c45aba40
    • D
      xfs: move xlog_commit_record to xfs_log_cil.c · 2ce82b72
      Dave Chinner 提交于
      It is only used by the CIL checkpoints, and is the counterpart to
      start record formatting and writing that is already local to
      xfs_log_cil.c.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      2ce82b72
    • D
      xfs: don't run shutdown callbacks on active iclogs · 502a01fa
      Dave Chinner 提交于
      When the log is shutdown, it currently walks all the iclogs and runs
      callbacks that are attached to the iclogs, regardless of whether the
      iclog is queued for IO completion or not. This creates a problem for
      contexts attaching callbacks to iclogs in that a racing shutdown can
      run the callbacks even before the attaching context has finished
      processing the iclog and releasing it for IO submission.
      
      If the callback processing of the iclog frees the structure that is
      attached to the iclog, then this leads to an UAF scenario that can
      only be protected against by holding the icloglock from the point
      callbacks are attached through to the release of the iclog. While we
      currently do this, it is not practical or sustainable.
      
      Hence we need to make shutdown processing the responsibility of the
      context that holds active references to the iclog. We know that the
      contexts attaching callbacks to the iclog must have active
      references to the iclog, and that means they must be in either
      ACTIVE or WANT_SYNC states. xlog_state_do_callback() will skip over
      iclogs in these states -except- when the log is shut down.
      
      xlog_state_do_callback() checks the state of the iclogs while
      holding the icloglock, therefore the reference count/state change
      that occurs in xlog_state_release_iclog() after the callbacks are
      atomic w.r.t. shutdown processing.
      
      We can't push the responsibility of callback cleanup onto the CIL
      context because we can have ACTIVE iclogs that have callbacks
      attached that have already been released. Hence we really need to
      internalise the cleanup of callbacks into xlog_state_release_iclog()
      processing.
      
      Indeed, we already have that internalisation via:
      
      xlog_state_release_iclog
        drop last reference
          ->SYNCING
        xlog_sync
          xlog_write_iclog
            if (log_is_shutdown)
              xlog_state_done_syncing()
      	  xlog_state_do_callback()
      	    <process shutdown on iclog that is now in SYNCING state>
      
      The problem is that xlog_state_release_iclog() aborts before doing
      anything if the log is already shut down. It assumes that the
      callbacks have already been cleaned up, and it doesn't need to do
      any cleanup.
      
      Hence the fix is to remove the xlog_is_shutdown() check from
      xlog_state_release_iclog() so that reference counts are correctly
      released from the iclogs, and when the reference count is zero we
      always transition to SYNCING if the log is shut down. Hence we'll
      always enter the xlog_sync() path in a shutdown and eventually end
      up erroring out the iclog IO and running xlog_state_do_callback() to
      process the callbacks attached to the iclog.
      
      This allows us to stop processing referenced ACTIVE/WANT_SYNC iclogs
      directly in the shutdown code, and in doing so gets rid of the UAF
      vector that currently exists. This then decouples the adding of
      callbacks to the iclogs from xlog_state_release_iclog() as we
      guarantee that xlog_state_release_iclog() will process the callbacks
      if the log has been shut down before xlog_state_release_iclog() has
      been called.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      502a01fa
    • D
      xfs: XLOG_STATE_IOERROR must die · 5112e206
      Dave Chinner 提交于
      We don't need an iclog state field to tell us the log has been shut
      down. We can just check the xlog_is_shutdown() instead. The avoids
      the need to have shutdown overwrite the current iclog state while
      being active used by the log code and so having to ensure that every
      iclog state check handles XLOG_STATE_IOERROR appropriately.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      5112e206
    • D
      xfs: convert XLOG_FORCED_SHUTDOWN() to xlog_is_shutdown() · 2039a272
      Dave Chinner 提交于
      Make it less shouty and a static inline before adding more calls
      through the log code.
      
      Also convert internal log code that uses XFS_FORCED_SHUTDOWN(mount)
      to use xlog_is_shutdown(log) as well.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      2039a272
  2. 10 8月, 2021 1 次提交
  3. 30 7月, 2021 1 次提交
    • D
      xfs: fix ordering violation between cache flushes and tail updates · 0dc8f7f1
      Dave Chinner 提交于
      There is a race between the new CIL async data device metadata IO
      completion cache flush and the log tail in the iclog the flush
      covers being updated. This can be seen by repeating generic/482 in a
      loop and eventually log recovery fails with a failures such as this:
      
      XFS (dm-3): Starting recovery (logdev: internal)
      XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0)
      XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify
      XFS (dm-3): Unmount and run xfs_repair
      XFS (dm-3): First 128 bytes of corrupted metadata buffer:
      00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117
      
      Analysis of the logwrite replay shows that there were no writes to
      the data device between the FUA @ write 124 and the FUA at write @
      125, but log recovery @ 125 failed. The difference was the one log
      write @ 125 moved the tail of the log forwards from (1,8) to (1,32)
      and so the inode create intent in (1,8) was not replayed and so the
      inode cluster was zero on disk when replay of the first inode item
      in (1,32) was attempted.
      
      What this meant was that the journal write that occurred at @ 125
      did not ensure that metadata completed before the iclog was written
      was correctly on stable storage. The tail of the log moved forward,
      so IO must have been completed between the two iclog writes. This
      means that there is a race condition between the unconditional async
      cache flush in the CIL push work and the tail LSN that is written to
      the iclog. This happens like so:
      
      CIL push work				AIL push work
      -------------				-------------
      Add to committing list
      start async data dev cache flush
      .....
      <flush completes>
      <all writes to old tail lsn are stable>
      xlog_write
        ....					push inode create buffer
      					<start IO>
      					.....
      xlog_write(commit record)
        ....					<IO completes>
        					log tail moves
        					  xlog_assign_tail_lsn()
      start_lsn == commit_lsn
        <no iclog preflush!>
      xlog_state_release_iclog
        __xlog_state_release_iclog()
          <writes *new* tail_lsn into iclog>
        xlog_sync()
          ....
          submit_bio()
      <tail in log moves forward without flushing written metadata>
      
      Essentially, this can only occur if the commit iclog is issued
      without a cache flush. If the iclog bio is submitted with
      REQ_PREFLUSH, then it will guarantee that all the completed IO is
      one stable storage before the iclog bio with the new tail LSN in it
      is written to the log.
      
      IOWs, the tail lsn that is written to the iclog needs to be sampled
      *before* we issue the cache flush that guarantees all IO up to that
      LSN has been completed.
      
      To fix this without giving up the performance advantage of the
      flush/FUA optimisations (e.g. g/482 runtime halves with 5.14-rc1
      compared to 5.13), we need to ensure that we always issue a cache
      flush if the tail LSN changes between the initial async flush and
      the commit record being written. THis requires sampling the tail_lsn
      before we start the flush, and then passing the sampled tail LSN to
      xlog_state_release_iclog() so it can determine if the the tail LSN
      has changed while writing the checkpoint. If the tail LSN has
      changed, then it needs to set the NEED_FLUSH flag on the iclog and
      we'll issue another cache flush before writing the iclog.
      
      Fixes: eef983ff ("xfs: journal IO cache flush reductions")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      0dc8f7f1
  4. 26 6月, 2021 2 次提交
    • D
      xfs: don't wait on future iclogs when pushing the CIL · 1effb72a
      Dave Chinner 提交于
      The iclogbuf ring attached to the struct xlog is circular, hence the
      first and last iclogs in the ring can only be determined by
      comparing them against the log->l_iclog pointer.
      
      In xfs_cil_push_work(), we want to wait on previous iclogs that were
      issued so that we can flush them to stable storage with the commit
      record write, and it simply waits on the previous iclog in the ring.
      This, however, leads to CIL push hangs in generic/019 like so:
      
      task:kworker/u33:0   state:D stack:12680 pid:    7 ppid:     2 flags:0x00004000
      Workqueue: xfs-cil/pmem1 xlog_cil_push_work
      Call Trace:
       __schedule+0x30b/0x9f0
       schedule+0x68/0xe0
       xlog_wait_on_iclog+0x121/0x190
       ? wake_up_q+0xa0/0xa0
       xlog_cil_push_work+0x994/0xa10
       ? _raw_spin_lock+0x15/0x20
       ? xfs_swap_extents+0x920/0x920
       process_one_work+0x1ab/0x390
       worker_thread+0x56/0x3d0
       ? rescuer_thread+0x3c0/0x3c0
       kthread+0x14d/0x170
       ? __kthread_bind_mask+0x70/0x70
       ret_from_fork+0x1f/0x30
      
      With other threads blocking in either xlog_state_get_iclog_space()
      waiting for iclog space or xlog_grant_head_wait() waiting for log
      reservation space.
      
      The problem here is that the previous iclog on the ring might
      actually be a future iclog. That is, if log->l_iclog points at
      commit_iclog, commit_iclog is the first (oldest) iclog in the ring
      and there are no previous iclogs pending as they have all completed
      their IO and been activated again. IOWs, commit_iclog->ic_prev
      points to an iclog that will be written in the future, not one that
      has been written in the past.
      
      Hence, in this case, waiting on the ->ic_prev iclog is incorrect
      behaviour, and depending on the state of the future iclog, we can
      end up with a circular ABA wait cycle and we hang.
      
      The fix is made more complex by the fact that many iclogs states
      cannot be used to determine if the iclog is a past or future iclog.
      Hence we have to determine past iclogs by checking the LSN of the
      iclog rather than their state. A past ACTIVE iclog will have a LSN
      of zero, while a future ACTIVE iclog will have a LSN greater than
      the current iclog. We don't wait on either of these cases.
      
      Similarly, a future iclog that hasn't completed IO will have an LSN
      greater than the current iclog and so we don't wait on them. A past
      iclog that is still undergoing IO completion will have a LSN less
      than the current iclog and those are the only iclogs that we need to
      wait on.
      
      Hence we can use the iclog LSN to determine what iclogs we need to
      wait on here.
      
      Fixes: 5fd9256ce156 ("xfs: separate CIL commit record IO")
      Reported-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      1effb72a
    • D
      xfs: Fix a CIL UAF by getting get rid of the iclog callback lock · a1bb8505
      Dave Chinner 提交于
      The iclog callback chain has it's own lock. That was added way back
      in 2008 by myself to alleviate severe lock contention on the
      icloglock in commit 114d23aa ("[XFS] Per iclog callback chain
      lock"). This was long before delayed logging took the icloglock out
      of the hot transaction commit path and removed all contention on it.
      Hence the separate ic_callback_lock doesn't serve any scalability
      purpose anymore, and hasn't for close on a decade.
      
      Further, we only attach callbacks to iclogs in one place where we
      are already taking the icloglock soon after attaching the callbacks.
      We also have to drop the icloglock to run callbacks and grab it
      immediately afterwards again. So given that the icloglock is no
      longer hot, making it cover callbacks again doesn't really change
      the locking patterns very much at all.
      
      We also need to extend the icloglock to cover callback addition to
      fix a zero-day UAF in the CIL push code. This occurs when shutdown
      races with xlog_cil_push_work() and the shutdown runs the callbacks
      before the push releases the iclog. This results in the CIL context
      structure attached to the iclog being freed by the callback before
      the CIL push has finished referencing it, leading to UAF bugs.
      
      Hence, to avoid this UAF, we need the callback attachment to be
      atomic with post processing of the commit iclog and references to
      the structures being attached to the iclog. This requires holding
      the icloglock as that's the only way to serialise iclog state
      against a shutdown in progress.
      
      The result is we need to be using the icloglock to protect the
      callback list addition and removal and serialise them with shutdown.
      That makes the ic_callback_lock redundant and so it can be removed.
      
      Fixes: 71e330b5 ("xfs: Introduce delayed logging core code")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      a1bb8505
  5. 22 6月, 2021 5 次提交
    • D
      xfs: xfs_log_force_lsn isn't passed a LSN · 5f9b4b0d
      Dave Chinner 提交于
      In doing an investigation into AIL push stalls, I was looking at the
      log force code to see if an async CIL push could be done instead.
      This lead me to xfs_log_force_lsn() and looking at how it works.
      
      xfs_log_force_lsn() is only called from inode synchronisation
      contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
      value as the LSN to sync the log to. This gets passed to
      xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
      journal, and then used by xfs_log_force_lsn() to flush the iclogs to
      the journal.
      
      The problem is that ip->i_itemp->ili_last_lsn does not store a
      log sequence number. What it stores is passed to it from the
      ->iop_committing method, which is called by xfs_log_commit_cil().
      The value this passes to the iop_committing method is the CIL
      context sequence number that the item was committed to.
      
      As it turns out, xlog_cil_force_lsn() converts the sequence to an
      actual commit LSN for the related context and returns that to
      xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
      variable that contained a sequence with an actual LSN and then uses
      that to sync the iclogs.
      
      This caused me some confusion for a while, even though I originally
      wrote all this code a decade ago. ->iop_committing is only used by
      a couple of log item types, and only inode items use the sequence
      number it is passed.
      
      Let's clean up the API, CIL structures and inode log item to call it
      a sequence number, and make it clear that the high level code is
      using CIL sequence numbers and not on-disk LSNs for integrity
      synchronisation purposes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      5f9b4b0d
    • D
      xfs: Fix CIL throttle hang when CIL space used going backwards · 19f4e7cc
      Dave Chinner 提交于
      A hang with tasks stuck on the CIL hard throttle was reported and
      largely diagnosed by Donald Buczek, who discovered that it was a
      result of the CIL context space usage decrementing in committed
      transactions once the hard throttle limit had been hit and processes
      were already blocked.  This resulted in the CIL push not waking up
      those waiters because the CIL context was no longer over the hard
      throttle limit.
      
      The surprising aspect of this was the CIL space usage going
      backwards regularly enough to trigger this situation. Assumptions
      had been made in design that the relogging process would only
      increase the size of the objects in the CIL, and so that space would
      only increase.
      
      This change and commit message fixes the issue and documents the
      result of an audit of the triggers that can cause the CIL space to
      go backwards, how large the backwards steps tend to be, the
      frequency in which they occur, and what the impact on the CIL
      accounting code is.
      
      Even though the CIL ctx->space_used can go backwards, it will only
      do so if the log item is already logged to the CIL and contains a
      space reservation for it's entire logged state. This is tracked by
      the shadow buffer state on the log item. If the item is not
      previously logged in the CIL it has no shadow buffer nor log vector,
      and hence the entire size of the logged item copied to the log
      vector is accounted to the CIL space usage. i.e.  it will always go
      up in this case.
      
      If the item has a log vector (i.e. already in the CIL) and the size
      decreases, then the existing log vector will be overwritten and the
      space usage will go down. This is the only condition where the space
      usage reduces, and it can only occur when an item is already tracked
      in the CIL. Hence we are safe from CIL space usage underruns as a
      result of log items decreasing in size when they are relogged.
      
      Typically this reduction in CIL usage occurs from metadata blocks
      being free, such as when a btree block merge occurs or a directory
      enter/xattr entry is removed and the da-tree is reduced in size.
      This generally results in a reduction in size of around a single
      block in the CIL, but also tends to increase the number of log
      vectors because the parent and sibling nodes in the tree needs to be
      updated when a btree block is removed. If a multi-level merge
      occurs, then we see reduction in size of 2+ blocks, but again the
      log vector count goes up.
      
      The other vector is inode fork size changes, which only log the
      current size of the fork and ignore the previously logged size when
      the fork is relogged. Hence if we are removing items from the inode
      fork (dir/xattr removal in shortform, extent record removal in
      extent form, etc) the relogged size of the inode for can decrease.
      
      No other log items can decrease in size either because they are a
      fixed size (e.g. dquots) or they cannot be relogged (e.g. relogging
      an intent actually creates a new intent log item and doesn't relog
      the old item at all.) Hence the only two vectors for CIL context
      size reduction are relogging inode forks and marking buffers active
      in the CIL as stale.
      
      Long story short: the majority of the code does the right thing and
      handles the reduction in log item size correctly, and only the CIL
      hard throttle implementation is problematic and needs fixing. This
      patch makes that fix, as well as adds comments in the log item code
      that result in items shrinking in size when they are relogged as a
      clear reminder that this can and does happen frequently.
      
      The throttle fix is based upon the change Donald proposed, though it
      goes further to ensure that once the throttle is activated, it
      captures all tasks until the CIL push issues a wakeup, regardless of
      whether the CIL space used has gone back under the throttle
      threshold.
      
      This ensures that we prevent tasks reducing the CIL slightly under
      the throttle threshold and then making more changes that push it
      well over the throttle limit. This is acheived by checking if the
      throttle wait queue is already active as a condition of throttling.
      Hence once we start throttling, we continue to apply the throttle
      until the CIL context push wakes everything on the wait queue.
      
      We can use waitqueue_active() for the waitqueue manipulations and
      checks as they are all done under the ctx->xc_push_lock. Hence the
      waitqueue has external serialisation and we can safely peek inside
      the wait queue without holding the internal waitqueue locks.
      
      Many thanks to Donald for his diagnostic and analysis work to
      isolate the cause of this hang.
      Reported-and-tested-by: NDonald Buczek <buczek@molgen.mpg.de>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      19f4e7cc
    • D
      xfs: journal IO cache flush reductions · eef983ff
      Dave Chinner 提交于
      Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
      guarantee the ordering requirements the journal has w.r.t. metadata
      writeback. THe two ordering constraints are:
      
      1. we cannot overwrite metadata in the journal until we guarantee
      that the dirty metadata has been written back in place and is
      stable.
      
      2. we cannot write back dirty metadata until it has been written to
      the journal and guaranteed to be stable (and hence recoverable) in
      the journal.
      
      The ordering guarantees of #1 are provided by REQ_PREFLUSH. This
      causes the journal IO to issue a cache flush and wait for it to
      complete before issuing the write IO to the journal. Hence all
      completed metadata IO is guaranteed to be stable before the journal
      overwrites the old metadata.
      
      The ordering guarantees of #2 are provided by the REQ_FUA, which
      ensures the journal writes do not complete until they are on stable
      storage. Hence by the time the last journal IO in a checkpoint
      completes, we know that the entire checkpoint is on stable storage
      and we can unpin the dirty metadata and allow it to be written back.
      
      This is the mechanism by which ordering was first implemented in XFS
      way back in 2002 by commit 95d97c36e5155075ba2eb22b17562cfcc53fcf96
      ("Add support for drive write cache flushing") in the xfs-archive
      tree.
      
      A lot has changed since then, most notably we now use delayed
      logging to checkpoint the filesystem to the journal rather than
      write each individual transaction to the journal. Cache flushes on
      journal IO are necessary when individual transactions are wholly
      contained within a single iclog. However, CIL checkpoints are single
      transactions that typically span hundreds to thousands of individual
      journal writes, and so the requirements for device cache flushing
      have changed.
      
      That is, the ordering rules I state above apply to ordering of
      atomic transactions recorded in the journal, not to the journal IO
      itself. Hence we need to ensure metadata is stable before we start
      writing a new transaction to the journal (guarantee #1), and we need
      to ensure the entire transaction is stable in the journal before we
      start metadata writeback (guarantee #2).
      
      Hence we only need a REQ_PREFLUSH on the journal IO that starts a
      new journal transaction to provide #1, and it is not on any other
      journal IO done within the context of that journal transaction.
      
      The CIL checkpoint already issues a cache flush before it starts
      writing to the log, so we no longer need the iclog IO to issue a
      REQ_REFLUSH for us. Hence if XLOG_START_TRANS is passed
      to xlog_write(), we no longer need to mark the first iclog in
      the log write with REQ_PREFLUSH for this case. As an added bonus,
      this ordering mechanism works for both internal and external logs,
      meaning we can remove the explicit data device cache flushes from
      the iclog write code when using external logs.
      
      Given the new ordering semantics of commit records for the CIL, we
      need iclogs containing commit records to issue a REQ_PREFLUSH. We
      also require unmount records to do this. Hence for both
      XLOG_COMMIT_TRANS and XLOG_UNMOUNT_TRANS xlog_write() calls we need
      to mark the first iclog being written with REQ_PREFLUSH.
      
      For both commit records and unmount records, we also want them
      immediately on stable storage, so we want to also mark the iclogs
      that contain these records to be marked REQ_FUA. That means if a
      record is split across multiple iclogs, they are all marked REQ_FUA
      and not just the last one so that when the transaction is completed
      all the parts of the record are on stable storage.
      
      And for external logs, unmount records need a pre-write data device
      cache flush similar to the CIL checkpoint cache pre-flush as the
      internal iclog write code does not do this implicitly anymore.
      
      As an optimisation, when the commit record lands in the same iclog
      as the journal transaction starts, we don't need to wait for
      anything and can simply use REQ_FUA to provide guarantee #2.  This
      means that for fsync() heavy workloads, the cache flush behaviour is
      completely unchanged and there is no degradation in performance as a
      result of optimise the multi-IO transaction case.
      
      The most notable sign that there is less IO latency on my test
      machine (nvme SSDs) is that the "noiclogs" rate has dropped
      substantially. This metric indicates that the CIL push is blocking
      in xlog_get_iclog_space() waiting for iclog IO completion to occur.
      With 8 iclogs of 256kB, the rate is appoximately 1 noiclog event to
      every 4 iclog writes. IOWs, every 4th call to xlog_get_iclog_space()
      is blocking waiting for log IO. With the changes in this patch, this
      drops to 1 noiclog event for every 100 iclog writes. Hence it is
      clear that log IO is completing much faster than it was previously,
      but it is also clear that for large iclog sizes, this isn't the
      performance limiting factor on this hardware.
      
      With smaller iclogs (32kB), however, there is a substantial
      difference. With the cache flush modifications, the journal is now
      running at over 4000 write IOPS, and the journal throughput is
      largely identical to the 256kB iclogs and the noiclog event rate
      stays low at about 1:50 iclog writes. The existing code tops out at
      about 2500 IOPS as the number of cache flushes dominate performance
      and latency. The noiclog event rate is about 1:4, and the
      performance variance is quite large as the journal throughput can
      fall to less than half the peak sustained rate when the cache flush
      rate prevents metadata writeback from keeping up and the log runs
      out of space and throttles reservations.
      
      As a result:
      
      	logbsize	fsmark create rate	rm -rf
      before	32kb		152851+/-5.3e+04	5m28s
      patched	32kb		221533+/-1.1e+04	5m24s
      
      before	256kb		220239+/-6.2e+03	4m58s
      patched	256kb		228286+/-9.2e+03	5m06s
      
      The rm -rf times are included because I ran them, but the
      differences are largely noise. This workload is largely metadata
      read IO latency bound and the changes to the journal cache flushing
      doesn't really make any noticable difference to behaviour apart from
      a reduction in noiclog events from background CIL pushing.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      eef983ff
    • D
      xfs: remove need_start_rec parameter from xlog_write() · 3468bb1c
      Dave Chinner 提交于
      The CIL push is the only call to xlog_write that sets this variable
      to true. The other callers don't need a start rec, and they tell
      xlog_write what to do by passing the type of ophdr they need written
      in the flags field. The need_start_rec parameter essentially tells
      xlog_write to to write an extra ophdr with a XLOG_START_TRANS type,
      so get rid of the variable to do this and pass XLOG_START_TRANS as
      the flag value into xlog_write() from the CIL push.
      
      $ size fs/xfs/xfs_log.o*
        text	   data	    bss	    dec	    hex	filename
       27595	    560	      8	  28163	   6e03	fs/xfs/xfs_log.o.orig
       27454	    560	      8	  28022	   6d76	fs/xfs/xfs_log.o.patched
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      3468bb1c
    • D
      xfs: CIL checkpoint flushes caches unconditionally · bad77c37
      Dave Chinner 提交于
      Currently every journal IO is issued as REQ_PREFLUSH | REQ_FUA to
      guarantee the ordering requirements the journal has w.r.t. metadata
      writeback. THe two ordering constraints are:
      
      1. we cannot overwrite metadata in the journal until we guarantee
      that the dirty metadata has been written back in place and is
      stable.
      
      2. we cannot write back dirty metadata until it has been written to
      the journal and guaranteed to be stable (and hence recoverable) in
      the journal.
      
      These rules apply to the atomic transactions recorded in the
      journal, not to the journal IO itself. Hence we need to ensure
      metadata is stable before we start writing a new transaction to the
      journal (guarantee #1), and we need to ensure the entire transaction
      is stable in the journal before we start metadata writeback
      (guarantee #2).
      
      The ordering guarantees of #1 are currently provided by REQ_PREFLUSH
      being added to every iclog IO. This causes the journal IO to issue a
      cache flush and wait for it to complete before issuing the write IO
      to the journal. Hence all completed metadata IO is guaranteed to be
      stable before the journal overwrites the old metadata.
      
      However, for long running CIL checkpoints that might do a thousand
      journal IOs, we don't need every single one of these iclog IOs to
      issue a cache flush - the cache flush done before the first iclog is
      submitted is sufficient to cover the entire range in the log that
      the checkpoint will overwrite because the CIL space reservation
      guarantees the tail of the log (completed metadata) is already
      beyond the range of the checkpoint write.
      
      Hence we only need a full cache flush between closing off the CIL
      checkpoint context (i.e. when the push switches it out) and issuing
      the first journal IO. Rather than plumbing this through to the
      journal IO, we can start this cache flush the moment the CIL context
      is owned exclusively by the push worker. The cache flush can be in
      progress while we process the CIL ready for writing, hence
      reducing the latency of the initial iclog write. This is especially
      true for large checkpoints, where we might have to process hundreds
      of thousands of log vectors before we issue the first iclog write.
      In these cases, it is likely the cache flush has already been
      completed by the time we have built the CIL log vector chain.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      bad77c37
  6. 18 6月, 2021 1 次提交
    • D
      xfs: separate CIL commit record IO · a79b28c2
      Dave Chinner 提交于
      To allow for iclog IO device cache flush behaviour to be optimised,
      we first need to separate out the commit record iclog IO from the
      rest of the checkpoint so we can wait for the checkpoint IO to
      complete before we issue the commit record.
      
      This separation is only necessary if the commit record is being
      written into a different iclog to the start of the checkpoint as the
      upcoming cache flushing changes requires completion ordering against
      the other iclogs submitted by the checkpoint.
      
      If the entire checkpoint and commit is in the one iclog, then they
      are both covered by the one set of cache flush primitives on the
      iclog and hence there is no need to separate them for ordering.
      
      Otherwise, we need to wait for all the previous iclogs to complete
      so they are ordered correctly and made stable by the REQ_PREFLUSH
      that the commit record iclog IO issues. This guarantees that if a
      reader sees the commit record in the journal, they will also see the
      entire checkpoint that commit record closes off.
      
      This also provides the guarantee that when the commit record IO
      completes, we can safely unpin all the log items in the checkpoint
      so they can be written back because the entire checkpoint is stable
      in the journal.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      a79b28c2
  7. 05 8月, 2020 1 次提交
  8. 29 7月, 2020 1 次提交
  9. 23 6月, 2020 1 次提交
    • D
      xfs: fix use-after-free on CIL context on shutdown · c7f87f39
      Dave Chinner 提交于
      xlog_wait() on the CIL context can reference a freed context if the
      waiter doesn't get scheduled before the CIL context is freed. This
      can happen when a task is on the hard throttle and the CIL push
      aborts due to a shutdown. This was detected by generic/019:
      
      thread 1			thread 2
      
      __xfs_trans_commit
       xfs_log_commit_cil
        <CIL size over hard throttle limit>
        xlog_wait
         schedule
      				xlog_cil_push_work
      				wake_up_all
      				<shutdown aborts commit>
      				xlog_cil_committed
      				kmem_free
      
         remove_wait_queue
          spin_lock_irqsave --> UAF
      
      Fix it by moving the wait queue to the CIL rather than keeping it in
      in the CIL context that gets freed on push completion. Because the
      wait queue is now independent of the CIL context and we might have
      multiple contexts in flight at once, only wake the waiters on the
      push throttle when the context we are pushing is over the hard
      throttle size threshold.
      
      Fixes: 0e7ab7ef ("xfs: Throttle commits on delayed background CIL push")
      Reported-by: NYu Kuai <yukuai3@huawei.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c7f87f39
  10. 27 3月, 2020 5 次提交
  11. 23 3月, 2020 3 次提交
  12. 11 11月, 2019 1 次提交
  13. 22 10月, 2019 1 次提交
  14. 27 8月, 2019 1 次提交
  15. 29 6月, 2019 6 次提交
  16. 15 4月, 2019 1 次提交
    • B
      xfs: wake commit waiters on CIL abort before log item abort · 545aa41f
      Brian Foster 提交于
      XFS shutdown deadlocks have been reproduced by fstest generic/475.
      The deadlock signature involves log I/O completion running error
      handling to abort logged items and waiting for an inode cluster
      buffer lock in the buffer item unpin handler. The buffer lock is
      held by xfsaild attempting to flush an inode. The buffer happens to
      be pinned and so xfs_iflush() triggers an async log force to begin
      work required to get it unpinned. The log force is blocked waiting
      on the commit completion, which never occurs and thus leaves the
      filesystem deadlocked.
      
      The root problem is that aborted log I/O completion pots commit
      completion behind callback completion, which is unexpected for async
      log forces. Under normal running conditions, an async log force
      returns to the caller once the CIL ctx has been formatted/submitted
      and the commit completion event triggered at the tail end of
      xlog_cil_push(). If the filesystem has shutdown, however, we rely on
      xlog_cil_committed() to trigger the completion event and it happens
      to do so after running log item unpin callbacks. This makes it
      unsafe to invoke an async log force from contexts that hold locks
      that might also be required in log completion processing.
      
      To address this problem, wake commit completion waiters before
      aborting log items in the log I/O completion handler. This ensures
      that an async log force will not deadlock on held locks if the
      filesystem happens to shutdown. Note that it is still unsafe to
      issue a sync log force while holding such locks because a sync log
      force explicitly waits on the force completion, which occurs after
      log I/O completion processing.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      545aa41f