1. 10 8月, 2021 9 次提交
    • D
      xfs: throttle inode inactivation queuing on memory reclaim · 40b1de00
      Darrick J. Wong 提交于
      Now that we defer inode inactivation, we've decoupled the process of
      unlinking or closing an inode from the process of inactivating it.  In
      theory this should lead to better throughput since we now inactivate the
      queued inodes in batches instead of one at a time.
      
      Unfortunately, one of the primary risks with this decoupling is the loss
      of rate control feedback between the frontend and background threads.
      In other words, a rm -rf /* thread can run the system out of memory if
      it can queue inodes for inactivation and jump to a new CPU faster than
      the background threads can actually clear the deferred work.  The
      workers can get scheduled off the CPU if they have to do IO, etc.
      
      To solve this problem, we configure a shrinker so that it will activate
      the /second/ time the shrinkers are called.  The custom shrinker will
      queue all percpu deferred inactivation workers immediately and set a
      flag to force frontend callers who are releasing a vfs inode to wait for
      the inactivation workers.
      
      On my test VM with 560M of RAM and a 2TB filesystem, this seems to solve
      most of the OOMing problem when deleting 10 million inodes.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      40b1de00
    • D
      xfs: avoid buffer deadlocks when walking fs inodes · a6343e4d
      Darrick J. Wong 提交于
      When we're servicing an INUMBERS or BULKSTAT request or running
      quotacheck, grab an empty transaction so that we can use its inherent
      recursive buffer locking abilities to detect inode btree cycles without
      hitting ABBA buffer deadlocks.  This patch requires the deferred inode
      inactivation patchset because xfs_irele cannot directly call
      xfs_inactive when the iwalk itself has an (empty) transaction.
      
      Found by fuzzing an inode btree pointer to introduce a cycle into the
      tree (xfs/365).
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      a6343e4d
    • D
      xfs: use background worker pool when transactions can't get free space · e8d04c2a
      Darrick J. Wong 提交于
      In xfs_trans_alloc, if the block reservation call returns ENOSPC, we
      call xfs_blockgc_free_space with a NULL icwalk structure to try to free
      space.  Each frontend thread that encounters this situation starts its
      own walk of the inode cache to see if it can find anything, which is
      wasteful since we don't have any additional selection criteria.  For
      this one common case, create a function that reschedules all pending
      background work immediately and flushes the workqueue so that the scan
      can run in parallel.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      e8d04c2a
    • D
      xfs: don't run speculative preallocation gc when fs is frozen · 6f649091
      Darrick J. Wong 提交于
      Now that we have the infrastructure to switch background workers on and
      off at will, fix the block gc worker code so that we don't actually run
      the worker when the filesystem is frozen, same as we do for deferred
      inactivation.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      6f649091
    • D
      xfs: flush inode inactivation work when compiling usage statistics · 01e8f379
      Darrick J. Wong 提交于
      Users have come to expect that the space accounting information in
      statfs and getquota reports are fairly accurate.  Now that we inactivate
      inodes from a background queue, these numbers can be thrown off by
      whatever resources are singly-owned by the inodes in the queue.  Flush
      the pending inactivations when userspace asks for a space usage report.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      01e8f379
    • D
      xfs: inactivate inodes any time we try to free speculative preallocations · 2eb66502
      Darrick J. Wong 提交于
      Other parts of XFS have learned to call xfs_blockgc_free_{space,quota}
      to try to free speculative preallocations when space is tight.  This
      means that file writes, transaction reservation failures, quota limit
      enforcement, and the EOFBLOCKS ioctl all call this function to free
      space when things are tight.
      
      Since inode inactivation is now a background task, this means that the
      filesystem can be hanging on to unlinked but not yet freed space.  Add
      this to the list of things that xfs_blockgc_free_* makes writer threads
      scan for when they cannot reserve space.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      2eb66502
    • D
      xfs: queue inactivation immediately when free realtime extents are tight · 65f03d86
      Darrick J. Wong 提交于
      Now that we have made the inactivation of unlinked inodes a background
      task to increase the throughput of file deletions, we need to be a
      little more careful about how long of a delay we can tolerate.
      
      Similar to the patch doing this for free space on the data device, if
      the file being inactivated is a realtime file and the realtime volume is
      running low on free extents, we want to run the worker ASAP so that the
      realtime allocator can make better decisions.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      65f03d86
    • D
      xfs: queue inactivation immediately when quota is nearing enforcement · 108523b8
      Darrick J. Wong 提交于
      Now that we have made the inactivation of unlinked inodes a background
      task to increase the throughput of file deletions, we need to be a
      little more careful about how long of a delay we can tolerate.
      
      Specifically, if the dquots attached to the inode being inactivated are
      nearing any kind of enforcement boundary, we want to queue that
      inactivation work immediately so that users don't get EDQUOT/ENOSPC
      errors even after they deleted a bunch of files to stay within quota.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      108523b8
    • D
      xfs: queue inactivation immediately when free space is tight · 7d6f07d2
      Darrick J. Wong 提交于
      Now that we have made the inactivation of unlinked inodes a background
      task to increase the throughput of file deletions, we need to be a
      little more careful about how long of a delay we can tolerate.
      
      On a mostly empty filesystem, the risk of the allocator making poor
      decisions due to fragmentation of the free space on account a lengthy
      delay in background updates is minimal because there's plenty of space.
      However, if free space is tight, we want to deallocate unlinked inodes
      as quickly as possible to avoid fallocate ENOSPC and to give the
      allocator the best shot at optimal allocations for new writes.
      
      Therefore, queue the percpu worker immediately if the filesystem is more
      than 95% full.  This follows the same principle that XFS becomes less
      aggressive about speculative allocations and lazy cleanup (and more
      precise about accounting) when nearing full.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      7d6f07d2
  2. 07 8月, 2021 9 次提交
  3. 30 7月, 2021 12 次提交
    • D
      xfs: prevent spoofing of rtbitmap blocks when recovering buffers · 81a448d7
      Darrick J. Wong 提交于
      While reviewing the buffer item recovery code, the thought occurred to
      me: in V5 filesystems we use log sequence number (LSN) tracking to avoid
      replaying older metadata updates against newer log items.  However, we
      use the magic number of the ondisk buffer to find the LSN of the ondisk
      metadata, which means that if an attacker can control the layout of the
      realtime device precisely enough that the start of an rt bitmap block
      matches the magic and UUID of some other kind of block, they can control
      the purported LSN of that spoofed block and thereby break log replay.
      
      Since realtime bitmap and summary blocks don't have headers at all, we
      have no way to tell if a block really should be replayed.  The best we
      can do is replay unconditionally and hope for the best.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      81a448d7
    • D
      xfs: limit iclog tail updates · 9d110014
      Dave Chinner 提交于
      From the department of "generic/482 keeps on giving", we bring you
      another tail update race condition:
      
      iclog:
      	S1			C1
      	+-----------------------+-----------------------+
      				 S2			EOIC
      
      Two checkpoints in a single iclog. One is complete, the other just
      contains the start record and overruns into a new iclog.
      
      Timeline:
      
      Before S1:	Cache flush, log tail = X
      At S1:		Metadata stable, write start record and checkpoint
      At C1:		Write commit record, set NEED_FUA
      		Single iclog checkpoint, so no need for NEED_FLUSH
      		Log tail still = X, so no need for NEED_FLUSH
      
      After C1,
      Before S2:	Cache flush, log tail = X
      At S2:		Metadata stable, write start record and checkpoint
      After S2:	Log tail moves to X+1
      At EOIC:	End of iclog, more journal data to write
      		Releases iclog
      		Not a commit iclog, so no need for NEED_FLUSH
      		Writes log tail X+1 into iclog.
      
      At this point, the iclog has tail X+1 and NEED_FUA set. There has
      been no cache flush for the metadata between X and X+1, and the
      iclog writes the new tail permanently to the log. THis is sufficient
      to violate on disk metadata/journal ordering.
      
      We have two options here. The first is to detect this case in some
      manner and ensure that the partial checkpoint write sets NEED_FLUSH
      when the iclog is already marked NEED_FUA and the log tail changes.
      This seems somewhat fragile and quite complex to get right, and it
      doesn't actually make it obvious what underlying problem it is
      actually addressing from reading the code.
      
      The second option seems much cleaner to me, because it is derived
      directly from the requirements of the C1 commit record in the iclog.
      That is, when we write this commit record to the iclog, we've
      guaranteed that the metadata/data ordering is correct for tail
      update purposes. Hence if we only write the log tail into the iclog
      for the *first* commit record rather than the log tail at the last
      release, we guarantee that the log tail does not move past where the
      the first commit record in the log expects it to be.
      
      IOWs, taking the first option means that replay of C1 becomes
      dependent on future operations doing the right thing, not just the
      C1 checkpoint itself doing the right thing. This makes log recovery
      almost impossible to reason about because now we have to take into
      account what might or might not have happened in the future when
      looking at checkpoints in the log rather than just having to
      reconstruct the past...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      9d110014
    • D
      xfs: need to see iclog flags in tracing · b2ae3a9e
      Dave Chinner 提交于
      Because I cannot tell if the NEED_FLUSH flag is being set correctly
      by the log force and CIL push machinery without it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      b2ae3a9e
    • D
      xfs: Enforce attr3 buffer recovery order · d8f4c2d0
      Dave Chinner 提交于
      From the department of "WTAF? How did we miss that!?"...
      
      When we are recovering a buffer, the first thing we do is check the
      buffer magic number and extract the LSN from the buffer. If the LSN
      is older than the current LSN, we replay the modification to it. If
      the metadata on disk is newer than the transaction in the log, we
      skip it. This is a fundamental v5 filesystem metadata recovery
      behaviour.
      
      generic/482 failed with an attribute writeback failure during log
      recovery. The write verifier caught the corruption before it got
      written to disk, and the attr buffer dump looked like:
      
      XFS (dm-3): Metadata corruption detected at xfs_attr3_leaf_verify+0x275/0x2e0, xfs_attr3_leaf block 0x19be8
      XFS (dm-3): Unmount and run xfs_repair
      XFS (dm-3): First 128 bytes of corrupted metadata buffer:
      00000000: 00 00 00 00 00 00 00 00 3b ee 00 00 4d 2a 01 e1  ........;...M*..
      00000010: 00 00 00 00 00 01 9b e8 00 00 00 01 00 00 05 38  ...............8
                                        ^^^^^^^^^^^^^^^^^^^^^^^
      00000020: df 39 5e 51 58 ac 44 b6 8d c5 e7 10 44 09 bc 17  .9^QX.D.....D...
      00000030: 00 00 00 00 00 02 00 83 00 03 00 cc 0f 24 01 00  .............$..
      00000040: 00 68 0e bc 0f c8 00 10 00 00 00 00 00 00 00 00  .h..............
      00000050: 00 00 3c 31 0f 24 01 00 00 00 3c 32 0f 88 01 00  ..<1.$....<2....
      00000060: 00 00 3c 33 0f d8 01 00 00 00 00 00 00 00 00 00  ..<3............
      00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      .....
      
      The highlighted bytes are the LSN that was replayed into the
      buffer: 0x100000538. This is cycle 1, block 0x538. Prior to replay,
      that block on disk looks like this:
      
      $ sudo xfs_db -c "fsb 0x417d" -c "type attr3" -c p /dev/mapper/thin-vol
      hdr.info.hdr.forw = 0
      hdr.info.hdr.back = 0
      hdr.info.hdr.magic = 0x3bee
      hdr.info.crc = 0xb5af0bc6 (correct)
      hdr.info.bno = 105448
      hdr.info.lsn = 0x100000900
                     ^^^^^^^^^^^
      hdr.info.uuid = df395e51-58ac-44b6-8dc5-e7104409bc17
      hdr.info.owner = 131203
      hdr.count = 2
      hdr.usedbytes = 120
      hdr.firstused = 3796
      hdr.holes = 1
      hdr.freemap[0-2] = [base,size]
      
      Note the LSN stamped into the buffer on disk: 1/0x900. The version
      on disk is much newer than the log transaction that was being
      replayed. That's a bug, and should -never- happen.
      
      So I immediately went to look at xlog_recover_get_buf_lsn() to check
      that we handled the LSN correctly. I was wondering if there was a
      similar "two commits with the same start LSN skips the second
      replay" problem with buffers. I didn't get that far, because I found
      a much more basic, rudimentary bug: xlog_recover_get_buf_lsn()
      doesn't recognise buffers with XFS_ATTR3_LEAF_MAGIC set in them!!!
      
      IOWs, attr3 leaf buffers fall through the magic number checks
      unrecognised, so trigger the "recover immediately" behaviour instead
      of undergoing an LSN check. IOWs, we incorrectly replay ATTR3 leaf
      buffers and that causes silent on disk corruption of inode attribute
      forks and potentially other things....
      
      Git history shows this is *another* zero day bug, this time
      introduced in commit 50d5c8d8 ("xfs: check LSN ordering for v5
      superblocks during recovery") which failed to handle the attr3 leaf
      buffers in recovery. And we've failed to handle them ever since...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      d8f4c2d0
    • D
      xfs: logging the on disk inode LSN can make it go backwards · 32baa63d
      Dave Chinner 提交于
      When we log an inode, we format the "log inode" core and set an LSN
      in that inode core. We do that via xfs_inode_item_format_core(),
      which calls:
      
      	xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);
      
      to format the log inode. It writes the LSN from the inode item into
      the log inode, and if recovery decides the inode item needs to be
      replayed, it recovers the log inode LSN field and writes it into the
      on disk inode LSN field.
      
      Now this might seem like a reasonable thing to do, but it is wrong
      on multiple levels. Firstly, if the item is not yet in the AIL,
      item->li_lsn is zero. i.e. the first time the inode it is logged and
      formatted, the LSN we write into the log inode will be zero. If we
      only log it once, recovery will run and can write this zero LSN into
      the inode.
      
      This means that the next time the inode is logged and log recovery
      runs, it will *always* replay changes to the inode regardless of
      whether the inode is newer on disk than the version in the log and
      that violates the entire purpose of recording the LSN in the inode
      at writeback time (i.e. to stop it going backwards in time on disk
      during recovery).
      
      Secondly, if we commit the CIL to the journal so the inode item
      moves to the AIL, and then relog the inode, the LSN that gets
      stamped into the log inode will be the LSN of the inode's current
      location in the AIL, not it's age on disk. And it's not the LSN that
      will be associated with the current change. That means when log
      recovery replays this inode item, the LSN that ends up on disk is
      the LSN for the previous changes in the log, not the current
      changes being replayed. IOWs, after recovery the LSN on disk is not
      in sync with the LSN of the modifications that were replayed into
      the inode. This, again, violates the recovery ordering semantics
      that on-disk writeback LSNs provide.
      
      Hence the inode LSN in the log dinode is -always- invalid.
      
      Thirdly, recovery actually has the LSN of the log transaction it is
      replaying right at hand - it uses it to determine if it should
      replay the inode by comparing it to the on-disk inode's LSN. But it
      doesn't use that LSN to stamp the LSN into the inode which will be
      written back when the transaction is fully replayed. It uses the one
      in the log dinode, which we know is always going to be incorrect.
      
      Looking back at the change history, the inode logging was broken by
      commit 93f958f9 ("xfs: cull unnecessary icdinode fields") way
      back in 2016 by a stupid idiot who thought he knew how this code
      worked. i.e. me. That commit replaced an in memory di_lsn field that
      was updated only at inode writeback time from the inode item.li_lsn
      value - and hence always contained the same LSN that appeared in the
      on-disk inode - with a read of the inode item LSN at inode format
      time. CLearly these are not the same thing.
      
      Before 93f958f9, the log recovery behaviour was irrelevant,
      because the LSN in the log inode always matched the on-disk LSN at
      the time the inode was logged, hence recovery of the transaction
      would never make the on-disk LSN in the inode go backwards or get
      out of sync.
      
      A symptom of the problem is this, caught from a failure of
      generic/482. Before log recovery, the inode has been allocated but
      never used:
      
      xfs_db> inode 393388
      xfs_db> p
      core.magic = 0x494e
      core.mode = 0
      ....
      v3.crc = 0x99126961 (correct)
      v3.change_count = 0
      v3.lsn = 0
      v3.flags2 = 0
      v3.cowextsize = 0
      v3.crtime.sec = Thu Jan  1 10:00:00 1970
      v3.crtime.nsec = 0
      
      After log recovery:
      
      xfs_db> p
      core.magic = 0x494e
      core.mode = 020444
      ....
      v3.crc = 0x23e68f23 (correct)
      v3.change_count = 2
      v3.lsn = 0
      v3.flags2 = 0
      v3.cowextsize = 0
      v3.crtime.sec = Thu Jul 22 17:03:03 2021
      v3.crtime.nsec = 751000000
      ...
      
      You can see that the LSN of the on-disk inode is 0, even though it
      clearly has been written to disk. I point out this inode, because
      the generic/482 failure occurred because several adjacent inodes in
      this specific inode cluster were not replayed correctly and still
      appeared to be zero on disk when all the other metadata (inobt,
      finobt, directories, etc) indicated they should be allocated and
      written back.
      
      The fix for this is two-fold. The first is that we need to either
      revert the LSN changes in 93f958f9 or stop logging the inode LSN
      altogether. If we do the former, log recovery does not need to
      change but we add 8 bytes of memory per inode to store what is
      largely a write-only inode field. If we do the latter, log recovery
      needs to stamp the on-disk inode in the same manner that inode
      writeback does.
      
      I prefer the latter, because we shouldn't really be trying to log
      and replay changes to the on disk LSN as the on-disk value is the
      canonical source of the on-disk version of the inode. It also
      matches the way we recover buffer items - we create a buf_log_item
      that carries the current recovery transaction LSN that gets stamped
      into the buffer by the write verifier when it gets written back
      when the transaction is fully recovered.
      
      However, this might break log recovery on older kernels even more,
      so I'm going to simply ignore the logged value in recovery and stamp
      the on-disk inode with the LSN of the transaction being recovered
      that will trigger writeback on transaction recovery completion. This
      will ensure that the on-disk inode LSN always reflects the LSN of
      the last change that was written to disk, regardless of whether it
      comes from log recovery or runtime writeback.
      
      Fixes: 93f958f9 ("xfs: cull unnecessary icdinode fields")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      32baa63d
    • D
      xfs: avoid unnecessary waits in xfs_log_force_lsn() · 8191d822
      Dave Chinner 提交于
      Before waiting on a iclog in xfs_log_force_lsn(), we don't check to
      see if the iclog has already been completed and the contents on
      stable storage. We check for completed iclogs in xfs_log_force(), so
      we should do the same thing for xfs_log_force_lsn().
      
      This fixed some random up-to-30s pauses seen in unmounting
      filesystems in some tests. A log force ends up waiting on completed
      iclog, and that doesn't then get flushed (and hence the log force
      get completed) until the background log worker issues a log force
      that flushes the iclog in question. Then the unmount unblocks and
      continues.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      8191d822
    • D
      xfs: log forces imply data device cache flushes · 2bf1ec0f
      Dave Chinner 提交于
      After fixing the tail_lsn vs cache flush race, generic/482 continued
      to fail in a similar way where cache flushes were missing before
      iclog FUA writes. Tracing of iclog state changes during the fsstress
      workload portion of the test (via xlog_iclog* events) indicated that
      iclog writes were coming from two sources - CIL pushes and log
      forces (due to fsync/O_SYNC operations). All of the cases where a
      recovery problem was triggered indicated that the log force was the
      source of the iclog write that was not preceeded by a cache flush.
      
      This was an oversight in the modifications made in commit
      eef983ff ("xfs: journal IO cache flush reductions"). Log forces
      for fsync imply a data device cache flush has been issued if an
      iclog was flushed to disk and is indicated to the caller via the
      log_flushed parameter so they can elide the device cache flush if
      the journal issued one.
      
      The change in eef983ff results in iclogs only issuing a cache
      flush if XLOG_ICL_NEED_FLUSH is set on the iclog, but this was not
      added to the iclogs that the log force code flushes to disk. Hence
      log forces are no longer guaranteeing that a cache flush is issued,
      hence opening up a potential on-disk ordering failure.
      
      Log forces should also set XLOG_ICL_NEED_FUA as well to ensure that
      the actual iclogs it forces to the journal are also on stable
      storage before it returns to the caller.
      
      This patch introduces the xlog_force_iclog() helper function to
      encapsulate the process of taking a reference to an iclog, switching
      its state if WANT_SYNC and flushing it to stable storage correctly.
      
      Both xfs_log_force() and xfs_log_force_lsn() are converted to use
      it, as is xlog_unmount_write() which has an elaborate method of
      doing exactly the same "write this iclog to stable storage"
      operation.
      
      Further, if the log force code needs to wait on a iclog in the
      WANT_SYNC state, it needs to ensure that iclog also results in a
      cache flush being issued. This covers the case where the iclog
      contains the commit record of the CIL flush that the log force
      triggered, but it hasn't been written yet because there is still an
      active reference to the iclog.
      
      Note: this whole cache flush whack-a-mole patch is a result of log
      forces still being iclog state centric rather than being CIL
      sequence centric. Most of this nasty code will go away in future
      when log forces are converted to wait on CIL sequence push
      completion rather than iclog completion. With the CIL push algorithm
      guaranteeing that the CIL checkpoint is fully on stable storage when
      it completes, we no longer need to iterate iclogs and push them to
      ensure a CIL sequence push has completed and so all this nasty iclog
      iteration and flushing code will go away.
      
      Fixes: eef983ff ("xfs: journal IO cache flush reductions")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      2bf1ec0f
    • D
      xfs: factor out forced iclog flushes · 45eddb41
      Dave Chinner 提交于
      We force iclogs in several places - we need them all to have the
      same cache flush semantics, so start by factoring out the iclog
      force into a common helper.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      45eddb41
    • D
      xfs: fix ordering violation between cache flushes and tail updates · 0dc8f7f1
      Dave Chinner 提交于
      There is a race between the new CIL async data device metadata IO
      completion cache flush and the log tail in the iclog the flush
      covers being updated. This can be seen by repeating generic/482 in a
      loop and eventually log recovery fails with a failures such as this:
      
      XFS (dm-3): Starting recovery (logdev: internal)
      XFS (dm-3): bad inode magic/vsn daddr 228352 #0 (magic=0)
      XFS (dm-3): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x37c00 xfs_inode_buf_verify
      XFS (dm-3): Unmount and run xfs_repair
      XFS (dm-3): First 128 bytes of corrupted metadata buffer:
      00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      XFS (dm-3): metadata I/O error in "xlog_recover_items_pass2+0x55/0xc0" at daddr 0x37c00 len 32 error 117
      
      Analysis of the logwrite replay shows that there were no writes to
      the data device between the FUA @ write 124 and the FUA at write @
      125, but log recovery @ 125 failed. The difference was the one log
      write @ 125 moved the tail of the log forwards from (1,8) to (1,32)
      and so the inode create intent in (1,8) was not replayed and so the
      inode cluster was zero on disk when replay of the first inode item
      in (1,32) was attempted.
      
      What this meant was that the journal write that occurred at @ 125
      did not ensure that metadata completed before the iclog was written
      was correctly on stable storage. The tail of the log moved forward,
      so IO must have been completed between the two iclog writes. This
      means that there is a race condition between the unconditional async
      cache flush in the CIL push work and the tail LSN that is written to
      the iclog. This happens like so:
      
      CIL push work				AIL push work
      -------------				-------------
      Add to committing list
      start async data dev cache flush
      .....
      <flush completes>
      <all writes to old tail lsn are stable>
      xlog_write
        ....					push inode create buffer
      					<start IO>
      					.....
      xlog_write(commit record)
        ....					<IO completes>
        					log tail moves
        					  xlog_assign_tail_lsn()
      start_lsn == commit_lsn
        <no iclog preflush!>
      xlog_state_release_iclog
        __xlog_state_release_iclog()
          <writes *new* tail_lsn into iclog>
        xlog_sync()
          ....
          submit_bio()
      <tail in log moves forward without flushing written metadata>
      
      Essentially, this can only occur if the commit iclog is issued
      without a cache flush. If the iclog bio is submitted with
      REQ_PREFLUSH, then it will guarantee that all the completed IO is
      one stable storage before the iclog bio with the new tail LSN in it
      is written to the log.
      
      IOWs, the tail lsn that is written to the iclog needs to be sampled
      *before* we issue the cache flush that guarantees all IO up to that
      LSN has been completed.
      
      To fix this without giving up the performance advantage of the
      flush/FUA optimisations (e.g. g/482 runtime halves with 5.14-rc1
      compared to 5.13), we need to ensure that we always issue a cache
      flush if the tail LSN changes between the initial async flush and
      the commit record being written. THis requires sampling the tail_lsn
      before we start the flush, and then passing the sampled tail LSN to
      xlog_state_release_iclog() so it can determine if the the tail LSN
      has changed while writing the checkpoint. If the tail LSN has
      changed, then it needs to set the NEED_FLUSH flag on the iclog and
      we'll issue another cache flush before writing the iclog.
      
      Fixes: eef983ff ("xfs: journal IO cache flush reductions")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      0dc8f7f1
    • D
      xfs: fold __xlog_state_release_iclog into xlog_state_release_iclog · 9d392064
      Dave Chinner 提交于
      Fold __xlog_state_release_iclog into its only caller to prepare
      make an upcoming fix easier.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [hch: split from a larger patch]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      9d392064
    • D
      xfs: external logs need to flush data device · b5d721ea
      Dave Chinner 提交于
      The recent journal flush/FUA changes replaced the flushing of the
      data device on every iclog write with an up-front async data device
      cache flush. Unfortunately, the assumption of which this was based
      on has been proven incorrect by the flush vs log tail update
      ordering issue. As the fix for that issue uses the
      XLOG_ICL_NEED_FLUSH flag to indicate that data device needs a cache
      flush, we now need to (once again) ensure that an iclog write to
      external logs that need a cache flush to be issued actually issue a
      cache flush to the data device as well as the log device.
      
      Fixes: eef983ff ("xfs: journal IO cache flush reductions")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      b5d721ea
    • D
      xfs: flush data dev on external log write · b1e27239
      Dave Chinner 提交于
      We incorrectly flush the log device instead of the data device when
      trying to ensure metadata is correctly on disk before writing the
      unmount record.
      
      Fixes: eef983ff ("xfs: journal IO cache flush reductions")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      b1e27239
  4. 16 7月, 2021 7 次提交
    • D
      xfs: detect misaligned rtinherit directory extent size hints · b102a46c
      Darrick J. Wong 提交于
      If we encounter a directory that has been configured to pass on an
      extent size hint to a new realtime file and the hint isn't an integer
      multiple of the rt extent size, we should flag the hint for
      administrative review because that is a misconfiguration (that other
      parts of the kernel will fix automatically).
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      b102a46c
    • D
      xfs: fix an integer overflow error in xfs_growfs_rt · 0925fecc
      Darrick J. Wong 提交于
      During a realtime grow operation, we run a single transaction for each
      rt bitmap block added to the filesystem.  This means that each step has
      to be careful to increase sb_rblocks appropriately.
      
      Fix the integer overflow error in this calculation that can happen when
      the extent size is very large.  Found by running growfs to add a rt
      volume to a filesystem formatted with a 1g rt extent size.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      0925fecc
    • D
      xfs: improve FSGROWFSRT precondition checking · 0e2af929
      Darrick J. Wong 提交于
      Improve the checking at the start of a realtime grow operation so that
      we avoid accidentally set a new extent size that is too large and avoid
      adding an rt volume to a filesystem with rmap or reflink because we
      don't support rt rmap or reflink yet.
      
      While we're at it, separate the checks so that we're only testing one
      aspect at a time.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      0e2af929
    • D
      xfs: don't expose misaligned extszinherit hints to userspace · 5aa5b278
      Darrick J. Wong 提交于
      Commit 603f000b changed xfs_ioctl_setattr_check_extsize to reject an
      attempt to set an EXTSZINHERIT extent size hint on a directory with
      RTINHERIT set if the hint isn't a multiple of the realtime extent size.
      However, I have recently discovered that it is possible to change the
      realtime extent size when adding a rt device to a filesystem, which
      means that the existence of directories with misaligned inherited hints
      is not an accident.
      
      As a result, it's possible that someone could have set a valid hint and
      added an rt volume with a different rt extent size, which invalidates
      the ondisk hints.  After such a sequence, FSGETXATTR will report a
      misaligned hint, which FSSETXATTR will trip over, causing confusion if
      the user was doing the usual GET/SET sequence to change some other
      attribute.  Change xfs_fill_fsxattr to omit the hint if it isn't aligned
      properly.
      
      Fixes: 603f000b ("xfs: validate extsz hints against rt extent size when rtinherit is set")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      5aa5b278
    • D
      xfs: correct the narrative around misaligned rtinherit/extszinherit dirs · 83193e5e
      Darrick J. Wong 提交于
      While auditing the realtime growfs code, I realized that the GROWFSRT
      ioctl (and by extension xfs_growfs) has always allowed sysadmins to
      change the realtime extent size when adding a realtime section to the
      filesystem.  Since we also have always allowed sysadmins to set
      RTINHERIT and EXTSZINHERIT on directories even if there is no realtime
      device, this invalidates the premise laid out in the comments added in
      commit 603f000b.
      
      In other words, this is not a case of inadequate metadata validation.
      This is a case of nearly forgotten (and apparently untested) but
      supported functionality.  Update the comments to reflect what we've
      learned, and remove the log message about correcting the misalignment.
      
      Fixes: 603f000b ("xfs: validate extsz hints against rt extent size when rtinherit is set")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      83193e5e
    • D
      xfs: reset child dir '..' entry when unlinking child · 5838d035
      Darrick J. Wong 提交于
      While running xfs/168, I noticed a second source of post-shrink
      corruption errors causing shutdowns.
      
      Let's say that directory B has a low inode number and is a child of
      directory A, which has a high number.  If B is empty but open, and
      unlinked from A, B's dotdot link continues to point to A.  If A is then
      unlinked and the filesystem shrunk so that A is no longer a valid inode,
      a subsequent AIL push of B will trip the inode verifiers because the
      dotdot entry points outside of the filesystem.
      
      To avoid this problem, reset B's dotdot entry to the root directory when
      unlinking directories, since the root directory cannot be removed.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NGao Xiang <hsiangkao@linux.alibaba.com>
      5838d035
    • D
      xfs: check for sparse inode clusters that cross new EOAG when shrinking · da062d16
      Darrick J. Wong 提交于
      While running xfs/168, I noticed occasional write verifier shutdowns
      involving inodes at the very end of the filesystem.  Existing inode
      btree validation code checks that all inode clusters are fully contained
      within the filesystem.
      
      However, due to inadequate checking in the fs shrink code, it's possible
      that there could be a sparse inode cluster at the end of the filesystem
      where the upper inodes of the cluster are marked as holes and the
      corresponding blocks are free.  In this case, the last blocks in the AG
      are listed in the bnobt.  This enables the shrink to proceed but results
      in a filesystem that trips the inode verifiers.  Fix this by disallowing
      the shrink.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NGao Xiang <hsiangkao@linux.alibaba.com>
      da062d16
  5. 12 7月, 2021 1 次提交
    • G
      xfs: Fix multiple fall-through warnings for Clang · 5937e000
      Gustavo A. R. Silva 提交于
      In preparation to enable -Wimplicit-fallthrough for Clang, fix
      the following warnings by replacing /* fallthrough */ comments,
      and its variants, with the new pseudo-keyword macro fallthrough:
      
      fs/xfs/libxfs/xfs_attr.c:487:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_attr.c:500:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_attr.c:532:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_attr.c:594:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_attr.c:607:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_attr.c:1410:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_attr.c:1445:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_attr.c:1473:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      
      Notice that Clang doesn't recognize /* fallthrough */ comments as
      implicit fall-through markings, so in order to globally enable
      -Wimplicit-fallthrough for Clang, these comments need to be
      replaced with fallthrough; in the whole codebase.
      
      Link: https://github.com/KSPP/linux/issues/115Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
      5937e000
  6. 30 6月, 2021 2 次提交