1. 30 3月, 2022 1 次提交
    • D
      xfs: aborting inodes on shutdown may need buffer lock · d2d7c047
      Dave Chinner 提交于
      Most buffer io list operations are run with the bp->b_lock held, but
      xfs_iflush_abort() can be called without the buffer lock being held
      resulting in inodes being removed from the buffer list while other
      list operations are occurring. This causes problems with corrupted
      bp->b_io_list inode lists during filesystem shutdown, leading to
      traversals that never end, double removals from the AIL, etc.
      
      Fix this by passing the buffer to xfs_iflush_abort() if we have
      it locked. If the inode is attached to the buffer, we're going to
      have to remove it from the buffer list and we'd have to get the
      buffer off the inode log item to do that anyway.
      
      If we don't have a buffer passed in (e.g. from xfs_reclaim_inode())
      then we can determine if the inode has a log item and if it is
      attached to a buffer before we do anything else. If it does have an
      attached buffer, we can lock it safely (because the inode has a
      reference to it) and then perform the inode abort.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      d2d7c047
  2. 20 3月, 2022 1 次提交
    • D
      xfs: xfs_is_shutdown vs xlog_is_shutdown cage fight · 01728b44
      Dave Chinner 提交于
      I've been chasing a recent resurgence in generic/388 recovery
      failure and/or corruption events. The events have largely been
      uninitialised inode chunks being tripped over in log recovery
      such as:
      
       XFS (pmem1): User initiated shutdown received.
       pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
       XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500).  Shutting down filesystem.
       XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
       XFS (pmem1): Unmounting Filesystem
       XFS (pmem1): Mounting V5 Filesystem
       XFS (pmem1): Starting recovery (logdev: internal)
       XFS (pmem1): bad inode magic/vsn daddr 8723584 #0 (magic=1818)
       XFS (pmem1): Metadata corruption detected at xfs_inode_buf_verify+0x180/0x190, xfs_inode block 0x851c80 xfs_inode_buf_verify
       XFS (pmem1): Unmount and run xfs_repair
       XFS (pmem1): First 128 bytes of corrupted metadata buffer:
       00000000: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000010: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000020: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000030: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000040: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000050: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000060: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       00000070: 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18 18  ................
       XFS (pmem1): metadata I/O error in "xlog_recover_items_pass2+0x52/0xc0" at daddr 0x851c80 len 32 error 117
       XFS (pmem1): log mount/recovery failed: error -117
       XFS (pmem1): log mount failed
      
      There have been isolated random other issues, too - xfs_repair fails
      because it finds some corruption in symlink blocks, rmap
      inconsistencies, etc - but they are nowhere near as common as the
      uninitialised inode chunk failure.
      
      The problem has clearly happened at runtime before recovery has run;
      I can see the ICREATE log item in the log shortly before the
      actively recovered range of the log. This means the ICREATE was
      definitely created and written to the log, but for some reason the
      tail of the log has been moved past the ordered buffer log item that
      tracks INODE_ALLOC buffers and, supposedly, prevents the tail of the
      log moving past the ICREATE log item before the inode chunk buffer
      is written to disk.
      
      Tracing the fsstress processes that are running when the filesystem
      shut down immediately pin-pointed the problem:
      
      user shutdown marks xfs_mount as shutdown
      
               godown-213341 [008]  6398.022871: console:              [ 6397.915392] XFS (pmem1): User initiated shutdown received.
      .....
      
      aild tries to push ordered inode cluster buffer
      
        xfsaild/pmem1-213314 [001]  6398.022974: xfs_buf_trylock:      dev 259:1 daddr 0x851c80 bbcount 0x20 hold 16 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_inode_item_push+0x8e
        xfsaild/pmem1-213314 [001]  6398.022976: xfs_ilock_nowait:     dev 259:1 ino 0x851c80 flags ILOCK_SHARED caller xfs_iflush_cluster+0xae
      
      xfs_iflush_cluster() checks xfs_is_shutdown(), returns true,
      calls xfs_iflush_abort() to kill writeback of the inode.
      Inode is removed from AIL, drops cluster buffer reference.
      
        xfsaild/pmem1-213314 [001]  6398.022977: xfs_ail_delete:       dev 259:1 lip 0xffff88880247ed80 old lsn 7/20344 new lsn 7/21000 type XFS_LI_INODE flags IN_AIL
        xfsaild/pmem1-213314 [001]  6398.022978: xfs_buf_rele:         dev 259:1 daddr 0x851c80 bbcount 0x20 hold 17 pincount 0 lock 0 flags DONE|INODES|PAGES caller xfs_iflush_abort+0xd7
      
      .....
      
      All inodes on cluster buffer are aborted, then the cluster buffer
      itself is aborted and removed from the AIL *without writeback*:
      
      xfsaild/pmem1-213314 [001]  6398.023011: xfs_buf_error_relse:  dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_ioend_fail+0x33
         xfsaild/pmem1-213314 [001]  6398.023012: xfs_ail_delete:       dev 259:1 lip 0xffff8888053efde8 old lsn 7/20344 new lsn 7/20344 type XFS_LI_BUF flags IN_AIL
      
      The inode buffer was at 7/20344 when it was removed from the AIL.
      
         xfsaild/pmem1-213314 [001]  6398.023012: xfs_buf_item_relse:   dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_done+0x31
         xfsaild/pmem1-213314 [001]  6398.023012: xfs_buf_rele:         dev 259:1 daddr 0x851c80 bbcount 0x20 hold 2 pincount 0 lock 0 flags ASYNC|DONE|STALE|INODES|PAGES caller xfs_buf_item_relse+0x39
      
      .....
      
      Userspace is still running, doing stuff. an fsstress process runs
      syncfs() or sync() and we end up in sync_fs_one_sb() which issues
      a log force. This pushes on the CIL:
      
              fsstress-213322 [001]  6398.024430: xfs_fs_sync_fs:       dev 259:1 m_features 0x20000000019ff6e9 opstate (clean|shutdown|inodegc|blockgc) s_flags 0x70810000 caller sync_fs_one_sb+0x26
              fsstress-213322 [001]  6398.024430: xfs_log_force:        dev 259:1 lsn 0x0 caller xfs_fs_sync_fs+0x82
              fsstress-213322 [001]  6398.024430: xfs_log_force:        dev 259:1 lsn 0x5f caller xfs_log_force+0x7c
                 <...>-194402 [001]  6398.024467: kmem_alloc:           size 176 flags 0x14 caller xlog_cil_push_work+0x9f
      
      And the CIL fills up iclogs with pending changes. This picks up
      the current tail from the AIL:
      
                 <...>-194402 [001]  6398.024497: xlog_iclog_get_space: dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x0 flags  caller xlog_write+0x149
                 <...>-194402 [001]  6398.024498: xlog_iclog_switch:    dev 259:1 state XLOG_STATE_ACTIVE refcnt 1 offset 0 lsn 0x700005408 flags  caller xlog_state_get_iclog_space+0x37e
                 <...>-194402 [001]  6398.024521: xlog_iclog_release:   dev 259:1 state XLOG_STATE_WANT_SYNC refcnt 1 offset 32256 lsn 0x700005408 flags  caller xlog_write+0x5f9
                 <...>-194402 [001]  6398.024522: xfs_log_assign_tail_lsn: dev 259:1 new tail lsn 7/21000, old lsn 7/20344, last sync 7/21448
      
      And it moves the tail of the log to 7/21000 from 7/20344. This
      *moves the tail of the log beyond the ICREATE transaction* that was
      at 7/20344 and pinned by the inode cluster buffer that was cancelled
      above.
      
      ....
      
               godown-213341 [008]  6398.027005: xfs_force_shutdown:   dev 259:1 tag logerror flags log_io|force_umount file fs/xfs/xfs_fsops.c line_num 500
                godown-213341 [008]  6398.027022: console:              [ 6397.915406] pmem1: writeback error on inode 12621949, offset 1019904, sector 12968096
                godown-213341 [008]  6398.030551: console:              [ 6397.919546] XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/
      
      And finally the log itself is now shutdown, stopping all further
      writes to the log. But this is too late to prevent the corruption
      that moving the tail of the log forwards after we start cancelling
      writeback causes.
      
      The fundamental problem here is that we are using the wrong shutdown
      checks for log items. We've long conflated mount shutdown with log
      shutdown state, and I started separating that recently with the
      atomic shutdown state changes in commit b36d4651 ("xfs: make
      forced shutdown processing atomic"). The changes in that commit
      series are directly responsible for being able to diagnose this
      issue because it clearly separated mount shutdown from log shutdown.
      
      Essentially, once we start cancelling writeback of log items and
      removing them from the AIL because the filesystem is shut down, we
      *cannot* update the journal because we may have cancelled the items
      that pin the tail of the log. That moves the tail of the log
      forwards without having written the metadata back, hence we have
      corrupt in memory state and writing to the journal propagates that
      to the on-disk state.
      
      What commit b36d4651 makes clear is that log item state needs to
      change relative to log shutdown, not mount shutdown. IOWs, anything
      that aborts metadata writeback needs to check log shutdown state
      because log items directly affect log consistency. Having them check
      mount shutdown state introduces the above race condition where we
      cancel metadata writeback before the log shuts down.
      
      To fix this, this patch works through all log items and converts
      shutdown checks to use xlog_is_shutdown() rather than
      xfs_is_shutdown(), so that we don't start aborting metadata
      writeback before we shut off journal writes.
      
      AFAICT, this race condition is a zero day IO error handling bug in
      XFS that dates back to the introduction of XLOG_IO_ERROR,
      XLOG_STATE_IOERROR and XFS_FORCED_SHUTDOWN back in January 1997.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      01728b44
  3. 15 3月, 2022 3 次提交
    • D
      xfs: constify the name argument to various directory functions · 996b2329
      Darrick J. Wong 提交于
      Various directory functions do not modify their @name parameter,
      so mark it const to make that clear.  This will enable us to mark
      the global xfs_name_dotdot variable as const to prevent mischief.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      996b2329
    • D
      xfs: reserve quota for target dir expansion when renaming files · 41667260
      Darrick J. Wong 提交于
      XFS does not reserve quota for directory expansion when renaming
      children into a directory.  This means that we don't reject the
      expansion with EDQUOT when we're at or near a hard limit, which means
      that unprivileged userspace can use rename() to exceed quota.
      
      Rename operations don't always expand the target directory, and we allow
      a rename to proceed with no space reservation if we don't need to add a
      block to the target directory to handle the addition.  Moreover, the
      unlink operation on the source directory generally does not expand the
      directory (you'd have to free a block and then cause a btree split) and
      it's probably of little consequence to leave the corner case that
      renaming a file out of a directory can increase its size.
      
      As with link and unlink, there is a further bug in that we do not
      trigger the blockgc workers to try to clear space when we're out of
      quota.
      
      Because rename is its own special tricky animal, we'll patch xfs_rename
      directly to reserve quota to the rename transaction.  We'll leave
      cleaning up the rest of xfs_rename for the metadata directory tree
      patchset.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      41667260
    • D
      xfs: reserve quota for dir expansion when linking/unlinking files · 871b9316
      Darrick J. Wong 提交于
      XFS does not reserve quota for directory expansion when linking or
      unlinking children from a directory.  This means that we don't reject
      the expansion with EDQUOT when we're at or near a hard limit, which
      means that unprivileged userspace can use link()/unlink() to exceed
      quota.
      
      The fix for this is nuanced -- link operations don't always expand the
      directory, and we allow a link to proceed with no space reservation if
      we don't need to add a block to the directory to handle the addition.
      Unlink operations generally do not expand the directory (you'd have to
      free a block and then cause a btree split) and we can defer the
      directory block freeing if there is no space reservation.
      
      Moreover, there is a further bug in that we do not trigger the blockgc
      workers to try to clear space when we're out of quota.
      
      To fix both cases, create a new xfs_trans_alloc_dir function that
      allocates the transaction, locks and joins the inodes, and reserves
      quota for the directory.  If there isn't sufficient space or quota,
      we'll switch the caller to reservationless mode.  This should prevent
      quota usage overruns with the least restriction in functionality.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      871b9316
  4. 05 12月, 2021 1 次提交
  5. 02 12月, 2021 1 次提交
  6. 31 10月, 2021 1 次提交
  7. 23 10月, 2021 2 次提交
  8. 20 8月, 2021 4 次提交
  9. 07 8月, 2021 1 次提交
  10. 16 7月, 2021 1 次提交
    • D
      xfs: reset child dir '..' entry when unlinking child · 5838d035
      Darrick J. Wong 提交于
      While running xfs/168, I noticed a second source of post-shrink
      corruption errors causing shutdowns.
      
      Let's say that directory B has a low inode number and is a child of
      directory A, which has a high number.  If B is empty but open, and
      unlinked from A, B's dotdot link continues to point to A.  If A is then
      unlinked and the filesystem shrunk so that A is no longer a valid inode,
      a subsequent AIL push of B will trip the inode verifiers because the
      dotdot entry points outside of the filesystem.
      
      To avoid this problem, reset B's dotdot entry to the root directory when
      unlinking directories, since the root directory cannot be removed.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NGao Xiang <hsiangkao@linux.alibaba.com>
      5838d035
  11. 13 7月, 2021 3 次提交
  12. 22 6月, 2021 1 次提交
    • D
      xfs: xfs_log_force_lsn isn't passed a LSN · 5f9b4b0d
      Dave Chinner 提交于
      In doing an investigation into AIL push stalls, I was looking at the
      log force code to see if an async CIL push could be done instead.
      This lead me to xfs_log_force_lsn() and looking at how it works.
      
      xfs_log_force_lsn() is only called from inode synchronisation
      contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn
      value as the LSN to sync the log to. This gets passed to
      xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the
      journal, and then used by xfs_log_force_lsn() to flush the iclogs to
      the journal.
      
      The problem is that ip->i_itemp->ili_last_lsn does not store a
      log sequence number. What it stores is passed to it from the
      ->iop_committing method, which is called by xfs_log_commit_cil().
      The value this passes to the iop_committing method is the CIL
      context sequence number that the item was committed to.
      
      As it turns out, xlog_cil_force_lsn() converts the sequence to an
      actual commit LSN for the related context and returns that to
      xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn"
      variable that contained a sequence with an actual LSN and then uses
      that to sync the iclogs.
      
      This caused me some confusion for a while, even though I originally
      wrote all this code a decade ago. ->iop_committing is only used by
      a couple of log item types, and only inode items use the sequence
      number it is passed.
      
      Let's clean up the API, CIL structures and inode log item to call it
      a sequence number, and make it clear that the high level code is
      using CIL sequence numbers and not on-disk LSNs for integrity
      synchronisation purposes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      5f9b4b0d
  13. 04 6月, 2021 1 次提交
  14. 02 6月, 2021 4 次提交
  15. 27 5月, 2021 1 次提交
    • G
      xfs: Fix fall-through warnings for Clang · 53004ee7
      Gustavo A. R. Silva 提交于
      In preparation to enable -Wimplicit-fallthrough for Clang, fix
      the following warnings by replacing /* fall through */ comments,
      and its variants, with the new pseudo-keyword macro fallthrough:
      
      fs/xfs/libxfs/xfs_alloc.c:3167:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_da_btree.c:286:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_ag_resv.c:346:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_ag_resv.c:388:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_bmap_util.c:246:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_export.c:88:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_export.c:96:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_file.c:867:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_ioctl.c:562:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_ioctl.c:1548:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_iomap.c:1040:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_inode.c:852:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_log.c:2627:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_trans_buf.c:298:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/bmap.c:275:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/btree.c:48:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/common.c:85:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/common.c:138:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/common.c:698:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/dabtree.c:51:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/repair.c:951:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/agheader.c:89:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      
      Notice that Clang doesn't recognize /* fall through */ comments as
      implicit fall-through markings, so in order to globally enable
      -Wimplicit-fallthrough for Clang, these comments need to be
      replaced with fallthrough; in the whole codebase.
      
      Link: https://github.com/KSPP/linux/issues/115Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
      53004ee7
  16. 25 5月, 2021 1 次提交
    • D
      xfs: validate extsz hints against rt extent size when rtinherit is set · 603f000b
      Darrick J. Wong 提交于
      The RTINHERIT bit can be set on a directory so that newly created
      regular files will have the REALTIME bit set to store their data on the
      realtime volume.  If an extent size hint (and EXTSZINHERIT) are set on
      the directory, the hint will also be copied into the new file.
      
      As pointed out in previous patches, for realtime files we require the
      extent size hint be an integer multiple of the realtime extent, but we
      don't perform the same validation on a directory with both RTINHERIT and
      EXTSZINHERIT set, even though the only use-case of that combination is
      to propagate extent size hints into new realtime files.  This leads to
      inode corruption errors when the bad values are propagated.
      
      Because there may be existing filesystems with such a configuration, we
      cannot simply amend the inode verifier to trip on these directories and
      call it a day because that will cause previously "working" filesystems
      to start throwing errors abruptly.  Note that it's valid to have
      directories with rtinherit set even if there is no realtime volume, in
      which case the problem does not manifest because rtinherit is ignored if
      there's no realtime device; and it's possible that someone set the flag,
      crashed, repaired the filesystem (which clears the hint on the realtime
      file) and continued.
      
      Therefore, mitigate this issue in several ways: First, if we try to
      write out an inode with both rtinherit/extszinherit set and an unaligned
      extent size hint, turn off the hint to correct the error.  Second, if
      someone tries to misconfigure a directory via the fssetxattr ioctl, fail
      the ioctl.  Third, reverify both extent size hint values when we
      propagate heritable inode attributes from parent to child, to prevent
      misconfigurations from spreading.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      603f000b
  17. 16 4月, 2021 1 次提交
  18. 08 4月, 2021 12 次提交