1. 07 8月, 2021 2 次提交
  2. 30 7月, 2021 1 次提交
    • D
      xfs: logging the on disk inode LSN can make it go backwards · 32baa63d
      Dave Chinner 提交于
      When we log an inode, we format the "log inode" core and set an LSN
      in that inode core. We do that via xfs_inode_item_format_core(),
      which calls:
      
      	xfs_inode_to_log_dinode(ip, dic, ip->i_itemp->ili_item.li_lsn);
      
      to format the log inode. It writes the LSN from the inode item into
      the log inode, and if recovery decides the inode item needs to be
      replayed, it recovers the log inode LSN field and writes it into the
      on disk inode LSN field.
      
      Now this might seem like a reasonable thing to do, but it is wrong
      on multiple levels. Firstly, if the item is not yet in the AIL,
      item->li_lsn is zero. i.e. the first time the inode it is logged and
      formatted, the LSN we write into the log inode will be zero. If we
      only log it once, recovery will run and can write this zero LSN into
      the inode.
      
      This means that the next time the inode is logged and log recovery
      runs, it will *always* replay changes to the inode regardless of
      whether the inode is newer on disk than the version in the log and
      that violates the entire purpose of recording the LSN in the inode
      at writeback time (i.e. to stop it going backwards in time on disk
      during recovery).
      
      Secondly, if we commit the CIL to the journal so the inode item
      moves to the AIL, and then relog the inode, the LSN that gets
      stamped into the log inode will be the LSN of the inode's current
      location in the AIL, not it's age on disk. And it's not the LSN that
      will be associated with the current change. That means when log
      recovery replays this inode item, the LSN that ends up on disk is
      the LSN for the previous changes in the log, not the current
      changes being replayed. IOWs, after recovery the LSN on disk is not
      in sync with the LSN of the modifications that were replayed into
      the inode. This, again, violates the recovery ordering semantics
      that on-disk writeback LSNs provide.
      
      Hence the inode LSN in the log dinode is -always- invalid.
      
      Thirdly, recovery actually has the LSN of the log transaction it is
      replaying right at hand - it uses it to determine if it should
      replay the inode by comparing it to the on-disk inode's LSN. But it
      doesn't use that LSN to stamp the LSN into the inode which will be
      written back when the transaction is fully replayed. It uses the one
      in the log dinode, which we know is always going to be incorrect.
      
      Looking back at the change history, the inode logging was broken by
      commit 93f958f9 ("xfs: cull unnecessary icdinode fields") way
      back in 2016 by a stupid idiot who thought he knew how this code
      worked. i.e. me. That commit replaced an in memory di_lsn field that
      was updated only at inode writeback time from the inode item.li_lsn
      value - and hence always contained the same LSN that appeared in the
      on-disk inode - with a read of the inode item LSN at inode format
      time. CLearly these are not the same thing.
      
      Before 93f958f9, the log recovery behaviour was irrelevant,
      because the LSN in the log inode always matched the on-disk LSN at
      the time the inode was logged, hence recovery of the transaction
      would never make the on-disk LSN in the inode go backwards or get
      out of sync.
      
      A symptom of the problem is this, caught from a failure of
      generic/482. Before log recovery, the inode has been allocated but
      never used:
      
      xfs_db> inode 393388
      xfs_db> p
      core.magic = 0x494e
      core.mode = 0
      ....
      v3.crc = 0x99126961 (correct)
      v3.change_count = 0
      v3.lsn = 0
      v3.flags2 = 0
      v3.cowextsize = 0
      v3.crtime.sec = Thu Jan  1 10:00:00 1970
      v3.crtime.nsec = 0
      
      After log recovery:
      
      xfs_db> p
      core.magic = 0x494e
      core.mode = 020444
      ....
      v3.crc = 0x23e68f23 (correct)
      v3.change_count = 2
      v3.lsn = 0
      v3.flags2 = 0
      v3.cowextsize = 0
      v3.crtime.sec = Thu Jul 22 17:03:03 2021
      v3.crtime.nsec = 751000000
      ...
      
      You can see that the LSN of the on-disk inode is 0, even though it
      clearly has been written to disk. I point out this inode, because
      the generic/482 failure occurred because several adjacent inodes in
      this specific inode cluster were not replayed correctly and still
      appeared to be zero on disk when all the other metadata (inobt,
      finobt, directories, etc) indicated they should be allocated and
      written back.
      
      The fix for this is two-fold. The first is that we need to either
      revert the LSN changes in 93f958f9 or stop logging the inode LSN
      altogether. If we do the former, log recovery does not need to
      change but we add 8 bytes of memory per inode to store what is
      largely a write-only inode field. If we do the latter, log recovery
      needs to stamp the on-disk inode in the same manner that inode
      writeback does.
      
      I prefer the latter, because we shouldn't really be trying to log
      and replay changes to the on disk LSN as the on-disk value is the
      canonical source of the on-disk version of the inode. It also
      matches the way we recover buffer items - we create a buf_log_item
      that carries the current recovery transaction LSN that gets stamped
      into the buffer by the write verifier when it gets written back
      when the transaction is fully recovered.
      
      However, this might break log recovery on older kernels even more,
      so I'm going to simply ignore the logged value in recovery and stamp
      the on-disk inode with the LSN of the transaction being recovered
      that will trigger writeback on transaction recovery completion. This
      will ensure that the on-disk inode LSN always reflects the LSN of
      the last change that was written to disk, regardless of whether it
      comes from log recovery or runtime writeback.
      
      Fixes: 93f958f9 ("xfs: cull unnecessary icdinode fields")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      32baa63d
  3. 16 7月, 2021 2 次提交
    • D
      xfs: correct the narrative around misaligned rtinherit/extszinherit dirs · 83193e5e
      Darrick J. Wong 提交于
      While auditing the realtime growfs code, I realized that the GROWFSRT
      ioctl (and by extension xfs_growfs) has always allowed sysadmins to
      change the realtime extent size when adding a realtime section to the
      filesystem.  Since we also have always allowed sysadmins to set
      RTINHERIT and EXTSZINHERIT on directories even if there is no realtime
      device, this invalidates the premise laid out in the comments added in
      commit 603f000b.
      
      In other words, this is not a case of inadequate metadata validation.
      This is a case of nearly forgotten (and apparently untested) but
      supported functionality.  Update the comments to reflect what we've
      learned, and remove the log message about correcting the misalignment.
      
      Fixes: 603f000b ("xfs: validate extsz hints against rt extent size when rtinherit is set")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      83193e5e
    • D
      xfs: check for sparse inode clusters that cross new EOAG when shrinking · da062d16
      Darrick J. Wong 提交于
      While running xfs/168, I noticed occasional write verifier shutdowns
      involving inodes at the very end of the filesystem.  Existing inode
      btree validation code checks that all inode clusters are fully contained
      within the filesystem.
      
      However, due to inadequate checking in the fs shrink code, it's possible
      that there could be a sparse inode cluster at the end of the filesystem
      where the upper inodes of the cluster are marked as holes and the
      corresponding blocks are free.  In this case, the last blocks in the AG
      are listed in the bnobt.  This enables the shrink to proceed but results
      in a filesystem that trips the inode verifiers.  Fix this by disallowing
      the shrink.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NGao Xiang <hsiangkao@linux.alibaba.com>
      da062d16
  4. 12 7月, 2021 1 次提交
    • G
      xfs: Fix multiple fall-through warnings for Clang · 5937e000
      Gustavo A. R. Silva 提交于
      In preparation to enable -Wimplicit-fallthrough for Clang, fix
      the following warnings by replacing /* fallthrough */ comments,
      and its variants, with the new pseudo-keyword macro fallthrough:
      
      fs/xfs/libxfs/xfs_attr.c:487:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_attr.c:500:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_attr.c:532:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_attr.c:594:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_attr.c:607:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_attr.c:1410:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_attr.c:1445:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_attr.c:1473:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      
      Notice that Clang doesn't recognize /* fallthrough */ comments as
      implicit fall-through markings, so in order to globally enable
      -Wimplicit-fallthrough for Clang, these comments need to be
      replaced with fallthrough; in the whole codebase.
      
      Link: https://github.com/KSPP/linux/issues/115Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
      5937e000
  5. 26 6月, 2021 1 次提交
  6. 22 6月, 2021 2 次提交
    • D
      xfs: fix endianness issue in xfs_ag_shrink_space · a8f3522c
      Darrick J. Wong 提交于
      The AGI buffer is in big-endian format, so we must convert the
      endianness to CPU format to do any comparisons.
      
      Fixes: 46141dc8 ("xfs: introduce xfs_ag_shrink_space()")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NGao Xiang <hsiangkao@linux.alibaba.com>
      a8f3522c
    • D
      xfs: xfs_log_force_lsn isn't passed a LSN · 5f9b4b0d
      Dave Chinner 提交于
      In doing an investigation into AIL push stalls, I was looking at the
      log force code to see if an async CIL push could be done instead.
      This lead me to xfs_log_force_lsn() and looking at how it works.
      
      xfs_log_force_lsn() is only called from inode synchronisation
      contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
      value as the LSN to sync the log to. This gets passed to
      xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
      journal, and then used by xfs_log_force_lsn() to flush the iclogs to
      the journal.
      
      The problem is that ip->i_itemp->ili_last_lsn does not store a
      log sequence number. What it stores is passed to it from the
      ->iop_committing method, which is called by xfs_log_commit_cil().
      The value this passes to the iop_committing method is the CIL
      context sequence number that the item was committed to.
      
      As it turns out, xlog_cil_force_lsn() converts the sequence to an
      actual commit LSN for the related context and returns that to
      xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
      variable that contained a sequence with an actual LSN and then uses
      that to sync the iclogs.
      
      This caused me some confusion for a while, even though I originally
      wrote all this code a decade ago. ->iop_committing is only used by
      a couple of log item types, and only inode items use the sequence
      number it is passed.
      
      Let's clean up the API, CIL structures and inode log item to call it
      a sequence number, and make it clear that the high level code is
      using CIL sequence numbers and not on-disk LSNs for integrity
      synchronisation purposes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      5f9b4b0d
  7. 18 6月, 2021 2 次提交
  8. 10 6月, 2021 2 次提交
  9. 09 6月, 2021 1 次提交
    • D
      xfs: drop the AGI being passed to xfs_check_agi_freecount · 9ba0889e
      Dave Chinner 提交于
      From: Dave Chinner <dchinner@redhat.com>
      
      Stephen Rothwell reported this compiler warning from linux-next:
      
      fs/xfs/libxfs/xfs_ialloc.c: In function 'xfs_difree_finobt':
      fs/xfs/libxfs/xfs_ialloc.c:2032:20: warning: unused variable 'agi' [-Wunused-variable]
       2032 |  struct xfs_agi   *agi = agbp->b_addr;
      
      Which is fallout from agno -> perag conversions that were done in
      this function. xfs_check_agi_freecount() is the only user of "agi"
      in xfs_difree_finobt() now, and it only uses the agi to get the
      current free inode count. We hold that in the perag structure, so
      there's not need to directly reference the raw AGI to get this
      information.
      
      The btree cursor being passed to xfs_check_agi_freecount() has a
      reference to the perag being operated on, so use that directly in
      xfs_check_agi_freecount() rather than passing an AGI.
      
      Fixes: 7b13c515 ("xfs: use perag for ialloc btree cursors")
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      9ba0889e
  10. 04 6月, 2021 1 次提交
  11. 03 6月, 2021 3 次提交
  12. 02 6月, 2021 22 次提交