1. 29 11月, 2022 3 次提交
    • D
      xfs: drop write error injection is unfixable, remove it · 6e8af15c
      Dave Chinner 提交于
      With the changes to scan the page cache for dirty data to avoid data
      corruptions from partial write cleanup racing with other page cache
      operations, the drop writes error injection no longer works the same
      way it used to and causes xfs/196 to fail. This is because xfs/196
      writes to the file and populates the page cache before it turns on
      the error injection and starts failing -overwrites-.
      
      The result is that the original drop-writes code failed writes only
      -after- overwriting the data in the cache, followed by invalidates
      the cached data, then punching out the delalloc extent from under
      that data.
      
      On the surface, this looks fine. The problem is that page cache
      invalidation *doesn't guarantee that it removes anything from the
      page cache* and it doesn't change the dirty state of the folio. When
      block size == page size and we do page aligned IO (as xfs/196 does)
      everything happens to align perfectly and page cache invalidation
      removes the single page folios that span the written data. Hence the
      followup delalloc punch pass does not find cached data over that
      range and it can punch the extent out.
      
      IOWs, xfs/196 "works" for block size == page size with the new
      code. I say "works", because it actually only works for the case
      where IO is page aligned, and no data was read from disk before
      writes occur. Because the moment we actually read data first, the
      readahead code allocates multipage folios and suddenly the
      invalidate code goes back to zeroing subfolio ranges without
      changing dirty state.
      
      Hence, with multipage folios in play, block size == page size is
      functionally identical to block size < page size behaviour, and
      drop-writes is manifestly broken w.r.t to this case. Invalidation of
      a subfolio range doesn't result in the folio being removed from the
      cache, just the range gets zeroed. Hence after we've sequentially
      walked over a folio that we've dirtied (via write data) and then
      invalidated, we end up with a dirty folio full of zeroed data.
      
      And because the new code skips punching ranges that have dirty
      folios covering them, we end up leaving the delalloc range intact
      after failing all the writes. Hence failed writes now end up
      writing zeroes to disk in the cases where invalidation zeroes folios
      rather than removing them from cache.
      
      This is a fundamental change of behaviour that is needed to avoid
      the data corruption vectors that exist in the old write fail path,
      and it renders the drop-writes injection non-functional and
      unworkable as it stands.
      
      As it is, I think the error injection is also now unnecessary, as
      partial writes that need delalloc extent are going to be a lot more
      common with stale iomap detection in place. Hence this patch removes
      the drop-writes error injection completely. xfs/196 can remain for
      testing kernels that don't have this data corruption fix, but those
      that do will report:
      
      xfs/196 3s ... [not run] XFS error injection drop_writes unknown on this kernel.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      6e8af15c
    • D
      xfs: use iomap_valid method to detect stale cached iomaps · 304a68b9
      Dave Chinner 提交于
      Now that iomap supports a mechanism to validate cached iomaps for
      buffered write operations, hook it up to the XFS buffered write ops
      so that we can avoid data corruptions that result from stale cached
      iomaps. See:
      
      https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
      
      or the ->iomap_valid() introduction commit for exact details of the
      corruption vector.
      
      The validity cookie we store in the iomap is based on the type of
      iomap we return. It is expected that the iomap->flags we set in
      xfs_bmbt_to_iomap() is not perturbed by the iomap core and are
      returned to us in the iomap passed via the .iomap_valid() callback.
      This ensures that the validity cookie is always checking the correct
      inode fork sequence numbers to detect potential changes that affect
      the extent cached by the iomap.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      304a68b9
    • D
      xfs: xfs_bmap_punch_delalloc_range() should take a byte range · 7348b322
      Dave Chinner 提交于
      All the callers of xfs_bmap_punch_delalloc_range() jump through
      hoops to convert a byte range to filesystem blocks before calling
      xfs_bmap_punch_delalloc_range(). Instead, pass the byte range to
      xfs_bmap_punch_delalloc_range() and have it do the conversion to
      filesystem blocks internally.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      7348b322
  2. 23 11月, 2022 3 次提交
    • D
      xfs,iomap: move delalloc punching to iomap · 9c7babf9
      Dave Chinner 提交于
      Because that's what Christoph wants for this error handling path
      only XFS uses.
      
      It requires a new iomap export for handling errors over delalloc
      ranges. This is basically the XFS code as is stands, but even though
      Christoph wants this as iomap funcitonality, we still have 
      to call it from the filesystem specific ->iomap_end callback, and
      call into the iomap code with yet another filesystem specific
      callback to punch the delalloc extent within the defined ranges.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      9c7babf9
    • D
      xfs: use byte ranges for write cleanup ranges · b71f889c
      Dave Chinner 提交于
      xfs_buffered_write_iomap_end() currently converts the byte ranges
      passed to it to filesystem blocks to pass them to the bmap code to
      punch out delalloc blocks, but then has to convert filesytem
      blocks back to byte ranges for page cache truncate.
      
      We're about to make the page cache truncate go away and replace it
      with a page cache walk, so having to convert everything to/from/to
      filesystem blocks is messy and error-prone. It is much easier to
      pass around byte ranges and convert to page indexes and/or
      filesystem blocks only where those units are needed.
      
      In preparation for the page cache walk being added, add a helper
      that converts byte ranges to filesystem blocks and calls
      xfs_bmap_punch_delalloc_range() and convert
      xfs_buffered_write_iomap_end() to calculate limits in byte ranges.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      b71f889c
    • D
      xfs: punching delalloc extents on write failure is racy · 198dd8ae
      Dave Chinner 提交于
      xfs_buffered_write_iomap_end() has a comment about the safety of
      punching delalloc extents based holding the IOLOCK_EXCL. This
      comment is wrong, and punching delalloc extents is not race free.
      
      When we punch out a delalloc extent after a write failure in
      xfs_buffered_write_iomap_end(), we punch out the page cache with
      truncate_pagecache_range() before we punch out the delalloc extents.
      At this point, we only hold the IOLOCK_EXCL, so there is nothing
      stopping mmap() write faults racing with this cleanup operation,
      reinstantiating a folio over the range we are about to punch and
      hence requiring the delalloc extent to be kept.
      
      If this race condition is hit, we can end up with a dirty page in
      the page cache that has no delalloc extent or space reservation
      backing it. This leads to bad things happening at writeback time.
      
      To avoid this race condition, we need the page cache truncation to
      be atomic w.r.t. the extent manipulation. We can do this by holding
      the mapping->invalidate_lock exclusively across this operation -
      this will prevent new pages from being inserted into the page cache
      whilst we are removing the pages and the backing extent and space
      reservation.
      
      Taking the mapping->invalidate_lock exclusively in the buffered
      write IO path is safe - it naturally nests inside the IOLOCK (see
      truncate and fallocate paths). iomap_zero_range() can be called from
      under the mapping->invalidate_lock (from the truncate path via
      either xfs_zero_eof() or xfs_truncate_page(), but iomap_zero_iter()
      will not instantiate new delalloc pages (because it skips holes) and
      hence will not ever need to punch out delalloc extents on failure.
      
      Fix the locking issue, and clean up the code logic a little to avoid
      unnecessary work if we didn't allocate the delalloc extent or wrote
      the entire region we allocated.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      198dd8ae
  3. 22 11月, 2022 1 次提交
    • L
      xfs: fix incorrect i_nlink caused by inode racing · 28b4b059
      Long Li 提交于
      The following error occurred during the fsstress test:
      
      XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2452
      
      The problem was that inode race condition causes incorrect i_nlink to be
      written to disk, and then it is read into memory. Consider the following
      call graph, inodes that are marked as both XFS_IFLUSHING and
      XFS_IRECLAIMABLE, i_nlink will be reset to 1 and then restored to original
      value in xfs_reinit_inode(). Therefore, the i_nlink of directory on disk
      may be set to 1.
      
        xfsaild
            xfs_inode_item_push
                xfs_iflush_cluster
                    xfs_iflush
                        xfs_inode_to_disk
      
        xfs_iget
            xfs_iget_cache_hit
                xfs_iget_recycle
                    xfs_reinit_inode
                        inode_init_always
      
      xfs_reinit_inode() needs to hold the ILOCK_EXCL as it is changing internal
      inode state and can race with other RCU protected inode lookups. On the
      read side, xfs_iflush_cluster() grabs the ILOCK_SHARED while under rcu +
      ip->i_flags_lock, and so xfs_iflush/xfs_inode_to_disk() are protected from
      racing inode updates (during transactions) by that lock.
      
      Fixes: ff7bebeb ("xfs: refactor the inode recycling code") # goes further back than this
      Signed-off-by: NLong Li <leo.lilong@huawei.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      28b4b059
  4. 17 11月, 2022 26 次提交
    • L
      xfs: Print XFS UUID on mount and umount events. · 64c80dfd
      Lukas Herbolt 提交于
      As of now only device names are printed out over __xfs_printk().
      The device names are not persistent across reboots which in case
      of searching for origin of corruption brings another task to properly
      identify the devices. This patch add XFS UUID upon every mount/umount
      event which will make the identification much easier.
      Signed-off-by: NLukas Herbolt <lukas@herbolt.com>
      [sandeen: rebase onto current upstream kernel]
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      64c80dfd
    • L
      xfs: fix sb write verify for lazysbcount · 59f6ab40
      Long Li 提交于
      When lazysbcount is enabled, fsstress and loop mount/unmount test report
      the following problems:
      
      XFS (loop0): SB summary counter sanity check failed
      XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460,
      	xfs_sb block 0x0
      XFS (loop0): Unmount and run xfs_repair
      XFS (loop0): First 128 bytes of corrupted metadata buffer:
      00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
      00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
      00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
      00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
      00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
      00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
      00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
      XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
      	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
      XFS (loop0): Please unmount the filesystem and rectify the problem(s)
      XFS (loop0): log mount/recovery failed: error -117
      XFS (loop0): log mount failed
      
      This corruption will shutdown the file system and the file system will
      no longer be mountable. The following script can reproduce the problem,
      but it may take a long time.
      
       #!/bin/bash
      
       device=/dev/sda
       testdir=/mnt/test
       round=0
      
       function fail()
       {
      	 echo "$*"
      	 exit 1
       }
      
       mkdir -p $testdir
       while [ $round -lt 10000 ]
       do
      	 echo "******* round $round ********"
      	 mkfs.xfs -f $device
      	 mount $device $testdir || fail "mount failed!"
      	 fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null &
      	 sleep 4
      	 killall -w fsstress
      	 umount $testdir
      	 xfs_repair -e $device > /dev/null
      	 if [ $? -eq 2 ];then
      		 echo "ERR CODE 2: Dirty log exception during repair."
      		 exit 1
      	 fi
      	 round=$(($round+1))
       done
      
      With lazysbcount is enabled, There is no additional lock protection for
      reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the
      m_ifree, this will make the m_ifree greater than m_icount. For example,
      consider the following sequence and ifreedelta is postive:
      
       CPU0				 CPU1
       xfs_log_sb			 xfs_trans_unreserve_and_mod_sb
       ----------			 ------------------------------
       percpu_counter_sum(&mp->m_icount)
      				 percpu_counter_add_batch(&mp->m_icount,
      						idelta, XFS_ICOUNT_BATCH)
      				 percpu_counter_add(&mp->m_ifree, ifreedelta);
       percpu_counter_sum(&mp->m_ifree)
      
      After this, incorrect inode count (sb_ifree > sb_icount) will be writen to
      the log. In the subsequent writing of sb, incorrect inode count (sb_ifree >
      sb_icount) will fail to pass the boundary check in xfs_validate_sb_write()
      that cause the file system shutdown.
      
      When lazysbcount is enabled, we don't need to guarantee that Lazy sb
      counters are completely correct, but we do need to guarantee that sb_ifree
      <= sb_icount. On the other hand, the constraint that m_ifree <= m_icount
      must be satisfied any time that there /cannot/ be other threads allocating
      or freeing inode chunks. If the constraint is violated under these
      circumstances, sb_i{count,free} (the ondisk superblock inode counters)
      maybe incorrect and need to be marked sick at unmount, the count will
      be rebuilt on the next mount.
      
      Fixes: 8756a5af ("libxfs: add more bounds checking to sb sanity checks")
      Signed-off-by: NLong Li <leo.lilong@huawei.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      59f6ab40
    • D
      xfs: fix incorrect error-out in xfs_remove · 2653d533
      Darrick J. Wong 提交于
      Clean up resources if resetting the dotdot entry doesn't succeed.
      Observed through code inspection.
      
      Fixes: 5838d035 ("xfs: reset child dir '..' entry when unlinking child")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAndrey Albershteyn <aalbersh@redhat.com>
      2653d533
    • D
      xfs: check inode core when scrubbing metadata files · f36b954a
      Darrick J. Wong 提交于
      Metadata files (e.g. realtime bitmaps and quota files) do not show up in
      the bulkstat output, which means that scrub-by-handle does not work;
      they can only be checked through a specific scrub type.  Therefore, each
      scrub type calls xchk_metadata_inode_forks to check the metadata for
      whatever's in the file.
      
      Unfortunately, that function doesn't actually check the inode record
      itself.  Refactor the function a bit to make that happen.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      f36b954a
    • D
      xfs: don't warn about files that are exactly s_maxbytes long · bd5ab5f9
      Darrick J. Wong 提交于
      We can handle files that are exactly s_maxbytes bytes long; we just
      can't handle anything larger than that.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      bd5ab5f9
    • D
      xfs: teach scrub to flag non-extents format cow forks · 5eef4635
      Darrick J. Wong 提交于
      CoW forks only exist in memory, which means that they can only ever have
      an incore extent tree.  Hence they must always be FMT_EXTENTS, so check
      this when we're scrubbing them.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      5eef4635
    • D
      xfs: check that CoW fork extents are not shared · 31785537
      Darrick J. Wong 提交于
      Ensure that extents in an inode's CoW fork are not marked as shared in
      the refcount btree.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      31785537
    • D
      xfs: check quota files for unwritten extents · f23c4044
      Darrick J. Wong 提交于
      Teach scrub to flag quota files containing unwritten extents.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      f23c4044
    • D
      xfs: block map scrub should handle incore delalloc reservations · 830ffa09
      Darrick J. Wong 提交于
      Enhance the block map scrubber to check delayed allocation reservations.
      Though there are no physical space allocations to check, we do need to
      make sure that the range of file offsets being mapped are correct, and
      to bump the lastoff cursor so that key order checking works correctly.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      830ffa09
    • D
      xfs: teach scrub to check for adjacent bmaps when rmap larger than bmap · 6a577786
      Darrick J. Wong 提交于
      When scrub is checking file fork mappings against rmap records and
      the rmap record starts before or ends after the bmap record, check the
      adjacent bmap records to make sure that they're adjacent to the one
      we're checking.  This helps us to detect cases where the rmaps cover
      territory that the bmaps do not.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      6a577786
    • D
      xfs: fix perag loop in xchk_bmap_check_rmaps · 033985b6
      Darrick J. Wong 提交于
      sparse complains that we can return an uninitialized error from this
      function and that pag could be uninitialized.  We know that there are no
      zero-AG filesystems and hence we had to call xchk_bmap_check_ag_rmaps at
      least once, so this is not actually possible, but I'm too worn out from
      automated complaints from unsophisticated AIs so let's just fix this and
      move on to more interesting problems, eh?
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      033985b6
    • D
      xfs: online checking of the free rt extent count · e74331d6
      Darrick J. Wong 提交于
      Teach the summary count checker to count the number of free realtime
      extents and compare that to the superblock copy.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      e74331d6
    • D
      xfs: make rtbitmap ILOCKing consistent when scanning the rt bitmap file · 5f369dc5
      Darrick J. Wong 提交于
      xfs_rtalloc_query_range scans the realtime bitmap file in order of
      increasing file offset, so this caller can take ILOCK_SHARED on the rt
      bitmap inode instead of ILOCK_EXCL.  This isn't going to yield any
      practical benefits at mount time, but we'd like to make the locking
      usage consistent around xfs_rtalloc_query_all calls.  Make all the
      places we do this use the same xfs_ilock lockflags for consistency.
      
      Fixes: 4c934c7d ("xfs: report realtime space information via the rtbitmap")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      5f369dc5
    • D
      xfs: don't return -EFSCORRUPTED from repair when resources cannot be grabbed · 93b0c58e
      Darrick J. Wong 提交于
      If we tried to repair something but the repair failed with -EDEADLOCK,
      that means that the repair function couldn't grab some resource it
      needed and wants us to try again.  If we try again (with TRY_HARDER) but
      still can't get all the resources we need, the repair fails and errors
      remain on the filesystem.
      
      Right now, repair returns the -EDEADLOCK to the caller as -EFSCORRUPTED,
      which results in XFS_SCRUB_OFLAG_CORRUPT being passed out to userspace.
      This is not correct because repair has not determined that anything is
      corrupt.  If the repair had been invoked on an object that could be
      optimized but wasn't corrupt (OFLAG_PREEN), the inability to grab
      resources will be reported to userspace as corrupt metadata, and users
      will be unnecessarily alarmed that their suboptimal metadata turned into
      a corruption.
      
      Fix this by returning zero so that the results of the actual scrub will
      be copied back out to userspace.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      93b0c58e
    • D
      xfs: skip fscounters comparisons when the scan is incomplete · 11f97e68
      Darrick J. Wong 提交于
      If any part of the per-AG summary counter scan loop aborts without
      collecting all of the data we need, the scrubber's observation data will
      be invalid.  Set the incomplete flag so that we abort the scrub without
      reporting false corruptions.  Document the data dependency here too.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      11f97e68
    • D
      xfs: load rtbitmap and rtsummary extent mapping btrees at mount time · 9e13975b
      Darrick J. Wong 提交于
      It turns out that GETFSMAP and online fsck have had a bug for years due
      to their use of ILOCK_SHARED to coordinate their linear scans of the
      realtime bitmap.  If the bitmap file's data fork happens to be in BTREE
      format and the scan occurs immediately after mounting, the incore bmbt
      will not be populated, leading to ASSERTs tripping over the incorrect
      inode state.  Because the bitmap scans always lock bitmap buffers in
      increasing order of file offset, it is appropriate for these two callers
      to take a shared ILOCK to improve scalability.
      
      To fix this problem, load both data and attr fork state into memory when
      mounting the realtime inodes.  Realtime metadata files aren't supposed
      to have an attr fork so the second step is likely a nop.
      
      On most filesystems this is unlikely since the rtbitmap data fork is
      usually in extents format, but it's possible to craft a filesystem that
      will by fragmenting the free space in the data section and growfsing the
      rt section.
      
      Fixes: 4c934c7d ("xfs: report realtime space information via the rtbitmap")
      Also-Fixes: 46d9bfb5 ("xfs: cross-reference the realtime bitmap")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      9e13975b
    • D
      xfs: pivot online scrub away from kmem.[ch] · 306195f3
      Darrick J. Wong 提交于
      Convert all the online scrub code to use the Linux slab allocator
      functions directly instead of going through the kmem wrappers.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      306195f3
    • D
      xfs: don't retry repairs harder when EAGAIN is returned · 6bf2f879
      Darrick J. Wong 提交于
      Repair functions will not return EAGAIN -- if they were not able to
      obtain resources, they should return EDEADLOCK (like the rest of online
      fsck) to signal that we need to grab all the resources and try again.
      Hence we don't need to deal with this case except as a debugging
      assertion.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      6bf2f879
    • D
      xfs: initialize the check_owner object fully · fcd2a434
      Darrick J. Wong 提交于
      Initialize the check_owner list head so that we don't corrupt the list.
      Reduce the scope of the object pointer.
      
      Fixes: 858333dc ("xfs: check btree block ownership with bnobt/rmapbt when scrubbing btree")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      fcd2a434
    • D
      xfs: fix return code when fatal signal encountered during dquot scrub · 0a713bd4
      Darrick J. Wong 提交于
      If the scrub process is sent a fatal signal while we're checking dquots,
      the predicate for this will set the error code to -EINTR.  Don't then
      squash that into -ECANCELED, because the wrong errno turns up in the
      trace output.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      0a713bd4
    • D
      xfs: return EINTR when a fatal signal terminates scrub · a7a0f9a5
      Darrick J. Wong 提交于
      If the program calling online fsck is terminated with a fatal signal,
      bail out to userspace by returning EINTR, not EAGAIN.  EAGAIN is used by
      scrubbers to indicate that we should try again with more resources
      locked, and not to indicate that the operation was cancelled.  The
      miswiring is mostly harmless, but it shows up in the trace data.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      a7a0f9a5
    • D
      xfs: make AGFL repair function avoid crosslinked blocks · b255fab0
      Darrick J. Wong 提交于
      Teach the AGFL repair function to check each block of the proposed AGFL
      against the rmap btree.  If the rmapbt finds any mappings that are not
      OWN_AG, strike that block from the list.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      b255fab0
    • D
      xfs: standardize GFP flags usage in online scrub · 48ff4045
      Darrick J. Wong 提交于
      Memory allocation usage is the same throughout online fsck -- we want
      kernel memory, we have to be able to back out if we can't allocate
      memory, and we don't want to spray dmesg with memory allocation failure
      reports.  Standardize the GFP flag usage and document these requirements.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      48ff4045
    • D
      xfs: log the AGI/AGF buffers when rolling transactions during an AG repair · 3e59c010
      Darrick J. Wong 提交于
      Currently, the only way to lock an allocation group is to hold the AGI
      and AGF buffers.  If a repair needs to roll the transaction while
      repairing some AG metadata, it maintains that lock by holding the two
      buffers across the transaction roll and joins them afterwards.
      
      However, repair is not like other parts of XFS that employ the bhold -
      roll - bjoin sequence because it's possible that the AGI or AGF buffers
      are not actually dirty before the roll.  This presents two problems --
      First, we need to redirty those buffers to keep them moving along in the
      log to avoid pinning the log tail.  Second, a clean buffer log item can
      detach from the buffer.  If this happens, the buffer type state is
      discarded along with the bli and must be reattached before the next time
      the buffer is logged.   If it is not, the logging code will complain and
      log recovery will not work properly.
      
      An earlier version of this patch tried to fix the second problem by
      re-setting the buffer type in the bli after joining the buffer to the
      new transaction, but that looked weird and didn't solve the first
      problem.  Instead, solve both problems by logging the buffer before
      rolling the transaction.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      3e59c010
    • D
      xfs: don't track the AGFL buffer in the scrub AG context · be1317fd
      Darrick J. Wong 提交于
      While scrubbing an allocation group, we don't need to hold the AGFL
      buffer as part of the scrub context.  All that is necessary to lock an
      AG is to hold the AGI and AGF buffers, so fix all the existing users of
      the AGFL buffer to grab them only when necessary.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      be1317fd
    • D
      xfs: fully initialize xfs_da_args in xchk_directory_blocks · 9a48b4a6
      Darrick J. Wong 提交于
      While running the online fsck test suite, I noticed the following
      assertion in the kernel log (edited for brevity):
      
      XFS: Assertion failed: 0, file: fs/xfs/xfs_health.c, line: 571
      ------------[ cut here ]------------
      WARNING: CPU: 3 PID: 11667 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
      CPU: 3 PID: 11667 Comm: xfs_scrub Tainted: G        W         5.19.0-rc7-xfsx #rc7 6e6475eb29fd9dda3181f81b7ca7ff961d277a40
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
      RIP: 0010:assfail+0x46/0x4a [xfs]
      Call Trace:
       <TASK>
       xfs_dir2_isblock+0xcc/0xe0
       xchk_directory_blocks+0xc7/0x420
       xchk_directory+0x53/0xb0
       xfs_scrub_metadata+0x2b6/0x6b0
       xfs_scrubv_metadata+0x35e/0x4d0
       xfs_ioc_scrubv_metadata+0x111/0x160
       xfs_file_ioctl+0x4ec/0xef0
       __x64_sys_ioctl+0x82/0xa0
       do_syscall_64+0x2b/0x80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      This assertion triggers in xfs_dirattr_mark_sick when the caller passes
      in a whichfork value that is neither of XFS_{DATA,ATTR}_FORK.  The cause
      of this is that xchk_directory_blocks only partially initializes the
      xfs_da_args structure that is passed to xfs_dir2_isblock.  If the data
      fork is not correct, the XFS_IS_CORRUPT clause will trigger.  My
      development branch reports this failure to the health monitoring
      subsystem, which accesses the uninitialized args->whichfork field,
      leading the the assertion tripping.  We really shouldn't be passing
      random stack contents around, so the solution here is to force the
      compiler to zero-initialize the struct.
      
      Found by fuzzing u3.bmx[0].blockcount = middlebit on xfs/1554.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      9a48b4a6
  5. 07 11月, 2022 1 次提交
    • D
      xfs: write page faults in iomap are not buffered writes · 118e021b
      Dave Chinner 提交于
      When we reserve a delalloc region in xfs_buffered_write_iomap_begin,
      we mark the iomap as IOMAP_F_NEW so that the the write context
      understands that it allocated the delalloc region.
      
      If we then fail that buffered write, xfs_buffered_write_iomap_end()
      checks for the IOMAP_F_NEW flag and if it is set, it punches out
      the unused delalloc region that was allocated for the write.
      
      The assumption this code makes is that all buffered write operations
      that can allocate space are run under an exclusive lock (i_rwsem).
      This is an invalid assumption: page faults in mmap()d regions call
      through this same function pair to map the file range being faulted
      and this runs only holding the inode->i_mapping->invalidate_lock in
      shared mode.
      
      IOWs, we can have races between page faults and write() calls that
      fail the nested page cache write operation that result in data loss.
      That is, the failing iomap_end call will punch out the data that
      the other racing iomap iteration brought into the page cache. This
      can be reproduced with generic/34[46] if we arbitrarily fail page
      cache copy-in operations from write() syscalls.
      
      Code analysis tells us that the iomap_page_mkwrite() function holds
      the already instantiated and uptodate folio locked across the iomap
      mapping iterations. Hence the folio cannot be removed from memory
      whilst we are mapping the range it covers, and as such we do not
      care if the mapping changes state underneath the iomap iteration
      loop:
      
      1. if the folio is not already dirty, there is no writeback races
         possible.
      2. if we allocated the mapping (delalloc or unwritten), the folio
         cannot already be dirty. See #1.
      3. If the folio is already dirty, it must be up to date. As we hold
         it locked, it cannot be reclaimed from memory. Hence we always
         have valid data in the page cache while iterating the mapping.
      4. Valid data in the page cache can exist when the underlying
         mapping is DELALLOC, UNWRITTEN or WRITTEN. Having the mapping
         change from DELALLOC->UNWRITTEN or UNWRITTEN->WRITTEN does not
         change the data in the page - it only affects actions if we are
         initialising a new page. Hence #3 applies  and we don't care
         about these extent map transitions racing with
         iomap_page_mkwrite().
      5. iomap_page_mkwrite() checks for page invalidation races
         (truncate, hole punch, etc) after it locks the folio. We also
         hold the mapping->invalidation_lock here, and hence the mapping
         cannot change due to extent removal operations while we are
         iterating the folio.
      
      As such, filesystems that don't use bufferheads will never fail
      the iomap_folio_mkwrite_iter() operation on the current mapping,
      regardless of whether the iomap should be considered stale.
      
      Further, the range we are asked to iterate is limited to the range
      inside EOF that the folio spans. Hence, for XFS, we will only map
      the exact range we are asked for, and we will only do speculative
      preallocation with delalloc if we are mapping a hole at the EOF
      page. The iterator will consume the entire range of the folio that
      is within EOF, and anything beyond the EOF block cannot be accessed.
      We never need to truncate this post-EOF speculative prealloc away in
      the context of the iomap_page_mkwrite() iterator because if it
      remains unused we'll remove it when the last reference to the inode
      goes away.
      
      Hence we don't actually need an .iomap_end() cleanup/error handling
      path at all for iomap_page_mkwrite() for XFS. This means we can
      separate the page fault processing from the complexity of the
      .iomap_end() processing in the buffered write path. This also means
      that the buffered write path will also be able to take the
      mapping->invalidate_lock as necessary.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      118e021b
  6. 31 10月, 2022 6 次提交