1. 29 11月, 2022 5 次提交
    • D
      xfs: drop write error injection is unfixable, remove it · 6e8af15c
      Dave Chinner 提交于
      With the changes to scan the page cache for dirty data to avoid data
      corruptions from partial write cleanup racing with other page cache
      operations, the drop writes error injection no longer works the same
      way it used to and causes xfs/196 to fail. This is because xfs/196
      writes to the file and populates the page cache before it turns on
      the error injection and starts failing -overwrites-.
      
      The result is that the original drop-writes code failed writes only
      -after- overwriting the data in the cache, followed by invalidates
      the cached data, then punching out the delalloc extent from under
      that data.
      
      On the surface, this looks fine. The problem is that page cache
      invalidation *doesn't guarantee that it removes anything from the
      page cache* and it doesn't change the dirty state of the folio. When
      block size == page size and we do page aligned IO (as xfs/196 does)
      everything happens to align perfectly and page cache invalidation
      removes the single page folios that span the written data. Hence the
      followup delalloc punch pass does not find cached data over that
      range and it can punch the extent out.
      
      IOWs, xfs/196 "works" for block size == page size with the new
      code. I say "works", because it actually only works for the case
      where IO is page aligned, and no data was read from disk before
      writes occur. Because the moment we actually read data first, the
      readahead code allocates multipage folios and suddenly the
      invalidate code goes back to zeroing subfolio ranges without
      changing dirty state.
      
      Hence, with multipage folios in play, block size == page size is
      functionally identical to block size < page size behaviour, and
      drop-writes is manifestly broken w.r.t to this case. Invalidation of
      a subfolio range doesn't result in the folio being removed from the
      cache, just the range gets zeroed. Hence after we've sequentially
      walked over a folio that we've dirtied (via write data) and then
      invalidated, we end up with a dirty folio full of zeroed data.
      
      And because the new code skips punching ranges that have dirty
      folios covering them, we end up leaving the delalloc range intact
      after failing all the writes. Hence failed writes now end up
      writing zeroes to disk in the cases where invalidation zeroes folios
      rather than removing them from cache.
      
      This is a fundamental change of behaviour that is needed to avoid
      the data corruption vectors that exist in the old write fail path,
      and it renders the drop-writes injection non-functional and
      unworkable as it stands.
      
      As it is, I think the error injection is also now unnecessary, as
      partial writes that need delalloc extent are going to be a lot more
      common with stale iomap detection in place. Hence this patch removes
      the drop-writes error injection completely. xfs/196 can remain for
      testing kernels that don't have this data corruption fix, but those
      that do will report:
      
      xfs/196 3s ... [not run] XFS error injection drop_writes unknown on this kernel.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      6e8af15c
    • D
      xfs: use iomap_valid method to detect stale cached iomaps · 304a68b9
      Dave Chinner 提交于
      Now that iomap supports a mechanism to validate cached iomaps for
      buffered write operations, hook it up to the XFS buffered write ops
      so that we can avoid data corruptions that result from stale cached
      iomaps. See:
      
      https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
      
      or the ->iomap_valid() introduction commit for exact details of the
      corruption vector.
      
      The validity cookie we store in the iomap is based on the type of
      iomap we return. It is expected that the iomap->flags we set in
      xfs_bmbt_to_iomap() is not perturbed by the iomap core and are
      returned to us in the iomap passed via the .iomap_valid() callback.
      This ensures that the validity cookie is always checking the correct
      inode fork sequence numbers to detect potential changes that affect
      the extent cached by the iomap.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      304a68b9
    • D
      iomap: write iomap validity checks · d7b64041
      Dave Chinner 提交于
      A recent multithreaded write data corruption has been uncovered in
      the iomap write code. The core of the problem is partial folio
      writes can be flushed to disk while a new racing write can map it
      and fill the rest of the page:
      
      writeback			new write
      
      allocate blocks
        blocks are unwritten
      submit IO
      .....
      				map blocks
      				iomap indicates UNWRITTEN range
      				loop {
      				  lock folio
      				  copyin data
      .....
      IO completes
        runs unwritten extent conv
          blocks are marked written
      				  <iomap now stale>
      				  get next folio
      				}
      
      Now add memory pressure such that memory reclaim evicts the
      partially written folio that has already been written to disk.
      
      When the new write finally gets to the last partial page of the new
      write, it does not find it in cache, so it instantiates a new page,
      sees the iomap is unwritten, and zeros the part of the page that
      it does not have data from. This overwrites the data on disk that
      was originally written.
      
      The full description of the corruption mechanism can be found here:
      
      https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
      
      To solve this problem, we need to check whether the iomap is still
      valid after we lock each folio during the write. We have to do it
      after we lock the page so that we don't end up with state changes
      occurring while we wait for the folio to be locked.
      
      Hence we need a mechanism to be able to check that the cached iomap
      is still valid (similar to what we already do in buffered
      writeback), and we need a way for ->begin_write to back out and
      tell the high level iomap iterator that we need to remap the
      remaining write range.
      
      The iomap needs to grow some storage for the validity cookie that
      the filesystem provides to travel with the iomap. XFS, in
      particular, also needs to know some more information about what the
      iomap maps (attribute extents rather than file data extents) to for
      the validity cookie to cover all the types of iomaps we might need
      to validate.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      d7b64041
    • D
      xfs: xfs_bmap_punch_delalloc_range() should take a byte range · 7348b322
      Dave Chinner 提交于
      All the callers of xfs_bmap_punch_delalloc_range() jump through
      hoops to convert a byte range to filesystem blocks before calling
      xfs_bmap_punch_delalloc_range(). Instead, pass the byte range to
      xfs_bmap_punch_delalloc_range() and have it do the conversion to
      filesystem blocks internally.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      7348b322
    • D
      iomap: buffered write failure should not truncate the page cache · f43dc4dc
      Dave Chinner 提交于
      iomap_file_buffered_write_punch_delalloc() currently invalidates the
      page cache over the unused range of the delalloc extent that was
      allocated. While the write allocated the delalloc extent, it does
      not own it exclusively as the write does not hold any locks that
      prevent either writeback or mmap page faults from changing the state
      of either the page cache or the extent state backing this range.
      
      Whilst xfs_bmap_punch_delalloc_range() already handles races in
      extent conversion - it will only punch out delalloc extents and it
      ignores any other type of extent - the page cache truncate does not
      discriminate between data written by this write or some other task.
      As a result, truncating the page cache can result in data corruption
      if the write races with mmap modifications to the file over the same
      range.
      
      generic/346 exercises this workload, and if we randomly fail writes
      (as will happen when iomap gets stale iomap detection later in the
      patchset), it will randomly corrupt the file data because it removes
      data written by mmap() in the same page as the write() that failed.
      
      Hence we do not want to punch out the page cache over the range of
      the extent we failed to write to - what we actually need to do is
      detect the ranges that have dirty data in cache over them and *not
      punch them out*.
      
      To do this, we have to walk the page cache over the range of the
      delalloc extent we want to remove. This is made complex by the fact
      we have to handle partially up-to-date folios correctly and this can
      happen even when the FSB size == PAGE_SIZE because we now support
      multi-page folios in the page cache.
      
      Because we are only interested in discovering the edges of data
      ranges in the page cache (i.e. hole-data boundaries) we can make use
      of mapping_seek_hole_data() to find those transitions in the page
      cache. As we hold the invalidate_lock, we know that the boundaries
      are not going to change while we walk the range. This interface is
      also byte-based and is sub-page block aware, so we can find the data
      ranges in the cache based on byte offsets rather than page, folio or
      fs block sized chunks. This greatly simplifies the logic of finding
      dirty cached ranges in the page cache.
      
      Once we've identified a range that contains cached data, we can then
      iterate the range folio by folio. This allows us to determine if the
      data is dirty and hence perform the correct delalloc extent punching
      operations. The seek interface we use to iterate data ranges will
      give us sub-folio start/end granularity, so we may end up looking up
      the same folio multiple times as the seek interface iterates across
      each discontiguous data region in the folio.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      f43dc4dc
  2. 23 11月, 2022 3 次提交
    • D
      xfs,iomap: move delalloc punching to iomap · 9c7babf9
      Dave Chinner 提交于
      Because that's what Christoph wants for this error handling path
      only XFS uses.
      
      It requires a new iomap export for handling errors over delalloc
      ranges. This is basically the XFS code as is stands, but even though
      Christoph wants this as iomap funcitonality, we still have 
      to call it from the filesystem specific ->iomap_end callback, and
      call into the iomap code with yet another filesystem specific
      callback to punch the delalloc extent within the defined ranges.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      9c7babf9
    • D
      xfs: use byte ranges for write cleanup ranges · b71f889c
      Dave Chinner 提交于
      xfs_buffered_write_iomap_end() currently converts the byte ranges
      passed to it to filesystem blocks to pass them to the bmap code to
      punch out delalloc blocks, but then has to convert filesytem
      blocks back to byte ranges for page cache truncate.
      
      We're about to make the page cache truncate go away and replace it
      with a page cache walk, so having to convert everything to/from/to
      filesystem blocks is messy and error-prone. It is much easier to
      pass around byte ranges and convert to page indexes and/or
      filesystem blocks only where those units are needed.
      
      In preparation for the page cache walk being added, add a helper
      that converts byte ranges to filesystem blocks and calls
      xfs_bmap_punch_delalloc_range() and convert
      xfs_buffered_write_iomap_end() to calculate limits in byte ranges.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      b71f889c
    • D
      xfs: punching delalloc extents on write failure is racy · 198dd8ae
      Dave Chinner 提交于
      xfs_buffered_write_iomap_end() has a comment about the safety of
      punching delalloc extents based holding the IOLOCK_EXCL. This
      comment is wrong, and punching delalloc extents is not race free.
      
      When we punch out a delalloc extent after a write failure in
      xfs_buffered_write_iomap_end(), we punch out the page cache with
      truncate_pagecache_range() before we punch out the delalloc extents.
      At this point, we only hold the IOLOCK_EXCL, so there is nothing
      stopping mmap() write faults racing with this cleanup operation,
      reinstantiating a folio over the range we are about to punch and
      hence requiring the delalloc extent to be kept.
      
      If this race condition is hit, we can end up with a dirty page in
      the page cache that has no delalloc extent or space reservation
      backing it. This leads to bad things happening at writeback time.
      
      To avoid this race condition, we need the page cache truncation to
      be atomic w.r.t. the extent manipulation. We can do this by holding
      the mapping->invalidate_lock exclusively across this operation -
      this will prevent new pages from being inserted into the page cache
      whilst we are removing the pages and the backing extent and space
      reservation.
      
      Taking the mapping->invalidate_lock exclusively in the buffered
      write IO path is safe - it naturally nests inside the IOLOCK (see
      truncate and fallocate paths). iomap_zero_range() can be called from
      under the mapping->invalidate_lock (from the truncate path via
      either xfs_zero_eof() or xfs_truncate_page(), but iomap_zero_iter()
      will not instantiate new delalloc pages (because it skips holes) and
      hence will not ever need to punch out delalloc extents on failure.
      
      Fix the locking issue, and clean up the code logic a little to avoid
      unnecessary work if we didn't allocate the delalloc extent or wrote
      the entire region we allocated.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      198dd8ae
  3. 22 11月, 2022 1 次提交
    • L
      xfs: fix incorrect i_nlink caused by inode racing · 28b4b059
      Long Li 提交于
      The following error occurred during the fsstress test:
      
      XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2452
      
      The problem was that inode race condition causes incorrect i_nlink to be
      written to disk, and then it is read into memory. Consider the following
      call graph, inodes that are marked as both XFS_IFLUSHING and
      XFS_IRECLAIMABLE, i_nlink will be reset to 1 and then restored to original
      value in xfs_reinit_inode(). Therefore, the i_nlink of directory on disk
      may be set to 1.
      
        xfsaild
            xfs_inode_item_push
                xfs_iflush_cluster
                    xfs_iflush
                        xfs_inode_to_disk
      
        xfs_iget
            xfs_iget_cache_hit
                xfs_iget_recycle
                    xfs_reinit_inode
                        inode_init_always
      
      xfs_reinit_inode() needs to hold the ILOCK_EXCL as it is changing internal
      inode state and can race with other RCU protected inode lookups. On the
      read side, xfs_iflush_cluster() grabs the ILOCK_SHARED while under rcu +
      ip->i_flags_lock, and so xfs_iflush/xfs_inode_to_disk() are protected from
      racing inode updates (during transactions) by that lock.
      
      Fixes: ff7bebeb ("xfs: refactor the inode recycling code") # goes further back than this
      Signed-off-by: NLong Li <leo.lilong@huawei.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      28b4b059
  4. 17 11月, 2022 26 次提交
    • L
      xfs: Print XFS UUID on mount and umount events. · 64c80dfd
      Lukas Herbolt 提交于
      As of now only device names are printed out over __xfs_printk().
      The device names are not persistent across reboots which in case
      of searching for origin of corruption brings another task to properly
      identify the devices. This patch add XFS UUID upon every mount/umount
      event which will make the identification much easier.
      Signed-off-by: NLukas Herbolt <lukas@herbolt.com>
      [sandeen: rebase onto current upstream kernel]
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      64c80dfd
    • L
      xfs: fix sb write verify for lazysbcount · 59f6ab40
      Long Li 提交于
      When lazysbcount is enabled, fsstress and loop mount/unmount test report
      the following problems:
      
      XFS (loop0): SB summary counter sanity check failed
      XFS (loop0): Metadata corruption detected at xfs_sb_write_verify+0x13b/0x460,
      	xfs_sb block 0x0
      XFS (loop0): Unmount and run xfs_repair
      XFS (loop0): First 128 bytes of corrupted metadata buffer:
      00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
      00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
      00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
      00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
      00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
      00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
      00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
      00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
      XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
      	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
      XFS (loop0): Please unmount the filesystem and rectify the problem(s)
      XFS (loop0): log mount/recovery failed: error -117
      XFS (loop0): log mount failed
      
      This corruption will shutdown the file system and the file system will
      no longer be mountable. The following script can reproduce the problem,
      but it may take a long time.
      
       #!/bin/bash
      
       device=/dev/sda
       testdir=/mnt/test
       round=0
      
       function fail()
       {
      	 echo "$*"
      	 exit 1
       }
      
       mkdir -p $testdir
       while [ $round -lt 10000 ]
       do
      	 echo "******* round $round ********"
      	 mkfs.xfs -f $device
      	 mount $device $testdir || fail "mount failed!"
      	 fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null &
      	 sleep 4
      	 killall -w fsstress
      	 umount $testdir
      	 xfs_repair -e $device > /dev/null
      	 if [ $? -eq 2 ];then
      		 echo "ERR CODE 2: Dirty log exception during repair."
      		 exit 1
      	 fi
      	 round=$(($round+1))
       done
      
      With lazysbcount is enabled, There is no additional lock protection for
      reading m_ifree and m_icount in xfs_log_sb(), if other cpu modifies the
      m_ifree, this will make the m_ifree greater than m_icount. For example,
      consider the following sequence and ifreedelta is postive:
      
       CPU0				 CPU1
       xfs_log_sb			 xfs_trans_unreserve_and_mod_sb
       ----------			 ------------------------------
       percpu_counter_sum(&mp->m_icount)
      				 percpu_counter_add_batch(&mp->m_icount,
      						idelta, XFS_ICOUNT_BATCH)
      				 percpu_counter_add(&mp->m_ifree, ifreedelta);
       percpu_counter_sum(&mp->m_ifree)
      
      After this, incorrect inode count (sb_ifree > sb_icount) will be writen to
      the log. In the subsequent writing of sb, incorrect inode count (sb_ifree >
      sb_icount) will fail to pass the boundary check in xfs_validate_sb_write()
      that cause the file system shutdown.
      
      When lazysbcount is enabled, we don't need to guarantee that Lazy sb
      counters are completely correct, but we do need to guarantee that sb_ifree
      <= sb_icount. On the other hand, the constraint that m_ifree <= m_icount
      must be satisfied any time that there /cannot/ be other threads allocating
      or freeing inode chunks. If the constraint is violated under these
      circumstances, sb_i{count,free} (the ondisk superblock inode counters)
      maybe incorrect and need to be marked sick at unmount, the count will
      be rebuilt on the next mount.
      
      Fixes: 8756a5af ("libxfs: add more bounds checking to sb sanity checks")
      Signed-off-by: NLong Li <leo.lilong@huawei.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      59f6ab40
    • D
      xfs: fix incorrect error-out in xfs_remove · 2653d533
      Darrick J. Wong 提交于
      Clean up resources if resetting the dotdot entry doesn't succeed.
      Observed through code inspection.
      
      Fixes: 5838d035 ("xfs: reset child dir '..' entry when unlinking child")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAndrey Albershteyn <aalbersh@redhat.com>
      2653d533
    • D
      xfs: check inode core when scrubbing metadata files · f36b954a
      Darrick J. Wong 提交于
      Metadata files (e.g. realtime bitmaps and quota files) do not show up in
      the bulkstat output, which means that scrub-by-handle does not work;
      they can only be checked through a specific scrub type.  Therefore, each
      scrub type calls xchk_metadata_inode_forks to check the metadata for
      whatever's in the file.
      
      Unfortunately, that function doesn't actually check the inode record
      itself.  Refactor the function a bit to make that happen.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      f36b954a
    • D
      xfs: don't warn about files that are exactly s_maxbytes long · bd5ab5f9
      Darrick J. Wong 提交于
      We can handle files that are exactly s_maxbytes bytes long; we just
      can't handle anything larger than that.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      bd5ab5f9
    • D
      xfs: teach scrub to flag non-extents format cow forks · 5eef4635
      Darrick J. Wong 提交于
      CoW forks only exist in memory, which means that they can only ever have
      an incore extent tree.  Hence they must always be FMT_EXTENTS, so check
      this when we're scrubbing them.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      5eef4635
    • D
      xfs: check that CoW fork extents are not shared · 31785537
      Darrick J. Wong 提交于
      Ensure that extents in an inode's CoW fork are not marked as shared in
      the refcount btree.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      31785537
    • D
      xfs: check quota files for unwritten extents · f23c4044
      Darrick J. Wong 提交于
      Teach scrub to flag quota files containing unwritten extents.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      f23c4044
    • D
      xfs: block map scrub should handle incore delalloc reservations · 830ffa09
      Darrick J. Wong 提交于
      Enhance the block map scrubber to check delayed allocation reservations.
      Though there are no physical space allocations to check, we do need to
      make sure that the range of file offsets being mapped are correct, and
      to bump the lastoff cursor so that key order checking works correctly.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      830ffa09
    • D
      xfs: teach scrub to check for adjacent bmaps when rmap larger than bmap · 6a577786
      Darrick J. Wong 提交于
      When scrub is checking file fork mappings against rmap records and
      the rmap record starts before or ends after the bmap record, check the
      adjacent bmap records to make sure that they're adjacent to the one
      we're checking.  This helps us to detect cases where the rmaps cover
      territory that the bmaps do not.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      6a577786
    • D
      xfs: fix perag loop in xchk_bmap_check_rmaps · 033985b6
      Darrick J. Wong 提交于
      sparse complains that we can return an uninitialized error from this
      function and that pag could be uninitialized.  We know that there are no
      zero-AG filesystems and hence we had to call xchk_bmap_check_ag_rmaps at
      least once, so this is not actually possible, but I'm too worn out from
      automated complaints from unsophisticated AIs so let's just fix this and
      move on to more interesting problems, eh?
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      033985b6
    • D
      xfs: online checking of the free rt extent count · e74331d6
      Darrick J. Wong 提交于
      Teach the summary count checker to count the number of free realtime
      extents and compare that to the superblock copy.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      e74331d6
    • D
      xfs: make rtbitmap ILOCKing consistent when scanning the rt bitmap file · 5f369dc5
      Darrick J. Wong 提交于
      xfs_rtalloc_query_range scans the realtime bitmap file in order of
      increasing file offset, so this caller can take ILOCK_SHARED on the rt
      bitmap inode instead of ILOCK_EXCL.  This isn't going to yield any
      practical benefits at mount time, but we'd like to make the locking
      usage consistent around xfs_rtalloc_query_all calls.  Make all the
      places we do this use the same xfs_ilock lockflags for consistency.
      
      Fixes: 4c934c7d ("xfs: report realtime space information via the rtbitmap")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      5f369dc5
    • D
      xfs: don't return -EFSCORRUPTED from repair when resources cannot be grabbed · 93b0c58e
      Darrick J. Wong 提交于
      If we tried to repair something but the repair failed with -EDEADLOCK,
      that means that the repair function couldn't grab some resource it
      needed and wants us to try again.  If we try again (with TRY_HARDER) but
      still can't get all the resources we need, the repair fails and errors
      remain on the filesystem.
      
      Right now, repair returns the -EDEADLOCK to the caller as -EFSCORRUPTED,
      which results in XFS_SCRUB_OFLAG_CORRUPT being passed out to userspace.
      This is not correct because repair has not determined that anything is
      corrupt.  If the repair had been invoked on an object that could be
      optimized but wasn't corrupt (OFLAG_PREEN), the inability to grab
      resources will be reported to userspace as corrupt metadata, and users
      will be unnecessarily alarmed that their suboptimal metadata turned into
      a corruption.
      
      Fix this by returning zero so that the results of the actual scrub will
      be copied back out to userspace.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      93b0c58e
    • D
      xfs: skip fscounters comparisons when the scan is incomplete · 11f97e68
      Darrick J. Wong 提交于
      If any part of the per-AG summary counter scan loop aborts without
      collecting all of the data we need, the scrubber's observation data will
      be invalid.  Set the incomplete flag so that we abort the scrub without
      reporting false corruptions.  Document the data dependency here too.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      11f97e68
    • D
      xfs: load rtbitmap and rtsummary extent mapping btrees at mount time · 9e13975b
      Darrick J. Wong 提交于
      It turns out that GETFSMAP and online fsck have had a bug for years due
      to their use of ILOCK_SHARED to coordinate their linear scans of the
      realtime bitmap.  If the bitmap file's data fork happens to be in BTREE
      format and the scan occurs immediately after mounting, the incore bmbt
      will not be populated, leading to ASSERTs tripping over the incorrect
      inode state.  Because the bitmap scans always lock bitmap buffers in
      increasing order of file offset, it is appropriate for these two callers
      to take a shared ILOCK to improve scalability.
      
      To fix this problem, load both data and attr fork state into memory when
      mounting the realtime inodes.  Realtime metadata files aren't supposed
      to have an attr fork so the second step is likely a nop.
      
      On most filesystems this is unlikely since the rtbitmap data fork is
      usually in extents format, but it's possible to craft a filesystem that
      will by fragmenting the free space in the data section and growfsing the
      rt section.
      
      Fixes: 4c934c7d ("xfs: report realtime space information via the rtbitmap")
      Also-Fixes: 46d9bfb5 ("xfs: cross-reference the realtime bitmap")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      9e13975b
    • D
      xfs: pivot online scrub away from kmem.[ch] · 306195f3
      Darrick J. Wong 提交于
      Convert all the online scrub code to use the Linux slab allocator
      functions directly instead of going through the kmem wrappers.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      306195f3
    • D
      xfs: don't retry repairs harder when EAGAIN is returned · 6bf2f879
      Darrick J. Wong 提交于
      Repair functions will not return EAGAIN -- if they were not able to
      obtain resources, they should return EDEADLOCK (like the rest of online
      fsck) to signal that we need to grab all the resources and try again.
      Hence we don't need to deal with this case except as a debugging
      assertion.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      6bf2f879
    • D
      xfs: initialize the check_owner object fully · fcd2a434
      Darrick J. Wong 提交于
      Initialize the check_owner list head so that we don't corrupt the list.
      Reduce the scope of the object pointer.
      
      Fixes: 858333dc ("xfs: check btree block ownership with bnobt/rmapbt when scrubbing btree")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      fcd2a434
    • D
      xfs: fix return code when fatal signal encountered during dquot scrub · 0a713bd4
      Darrick J. Wong 提交于
      If the scrub process is sent a fatal signal while we're checking dquots,
      the predicate for this will set the error code to -EINTR.  Don't then
      squash that into -ECANCELED, because the wrong errno turns up in the
      trace output.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      0a713bd4
    • D
      xfs: return EINTR when a fatal signal terminates scrub · a7a0f9a5
      Darrick J. Wong 提交于
      If the program calling online fsck is terminated with a fatal signal,
      bail out to userspace by returning EINTR, not EAGAIN.  EAGAIN is used by
      scrubbers to indicate that we should try again with more resources
      locked, and not to indicate that the operation was cancelled.  The
      miswiring is mostly harmless, but it shows up in the trace data.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      a7a0f9a5
    • D
      xfs: make AGFL repair function avoid crosslinked blocks · b255fab0
      Darrick J. Wong 提交于
      Teach the AGFL repair function to check each block of the proposed AGFL
      against the rmap btree.  If the rmapbt finds any mappings that are not
      OWN_AG, strike that block from the list.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      b255fab0
    • D
      xfs: standardize GFP flags usage in online scrub · 48ff4045
      Darrick J. Wong 提交于
      Memory allocation usage is the same throughout online fsck -- we want
      kernel memory, we have to be able to back out if we can't allocate
      memory, and we don't want to spray dmesg with memory allocation failure
      reports.  Standardize the GFP flag usage and document these requirements.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      48ff4045
    • D
      xfs: log the AGI/AGF buffers when rolling transactions during an AG repair · 3e59c010
      Darrick J. Wong 提交于
      Currently, the only way to lock an allocation group is to hold the AGI
      and AGF buffers.  If a repair needs to roll the transaction while
      repairing some AG metadata, it maintains that lock by holding the two
      buffers across the transaction roll and joins them afterwards.
      
      However, repair is not like other parts of XFS that employ the bhold -
      roll - bjoin sequence because it's possible that the AGI or AGF buffers
      are not actually dirty before the roll.  This presents two problems --
      First, we need to redirty those buffers to keep them moving along in the
      log to avoid pinning the log tail.  Second, a clean buffer log item can
      detach from the buffer.  If this happens, the buffer type state is
      discarded along with the bli and must be reattached before the next time
      the buffer is logged.   If it is not, the logging code will complain and
      log recovery will not work properly.
      
      An earlier version of this patch tried to fix the second problem by
      re-setting the buffer type in the bli after joining the buffer to the
      new transaction, but that looked weird and didn't solve the first
      problem.  Instead, solve both problems by logging the buffer before
      rolling the transaction.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      3e59c010
    • D
      xfs: don't track the AGFL buffer in the scrub AG context · be1317fd
      Darrick J. Wong 提交于
      While scrubbing an allocation group, we don't need to hold the AGFL
      buffer as part of the scrub context.  All that is necessary to lock an
      AG is to hold the AGI and AGF buffers, so fix all the existing users of
      the AGFL buffer to grab them only when necessary.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      be1317fd
    • D
      xfs: fully initialize xfs_da_args in xchk_directory_blocks · 9a48b4a6
      Darrick J. Wong 提交于
      While running the online fsck test suite, I noticed the following
      assertion in the kernel log (edited for brevity):
      
      XFS: Assertion failed: 0, file: fs/xfs/xfs_health.c, line: 571
      ------------[ cut here ]------------
      WARNING: CPU: 3 PID: 11667 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
      CPU: 3 PID: 11667 Comm: xfs_scrub Tainted: G        W         5.19.0-rc7-xfsx #rc7 6e6475eb29fd9dda3181f81b7ca7ff961d277a40
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
      RIP: 0010:assfail+0x46/0x4a [xfs]
      Call Trace:
       <TASK>
       xfs_dir2_isblock+0xcc/0xe0
       xchk_directory_blocks+0xc7/0x420
       xchk_directory+0x53/0xb0
       xfs_scrub_metadata+0x2b6/0x6b0
       xfs_scrubv_metadata+0x35e/0x4d0
       xfs_ioc_scrubv_metadata+0x111/0x160
       xfs_file_ioctl+0x4ec/0xef0
       __x64_sys_ioctl+0x82/0xa0
       do_syscall_64+0x2b/0x80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      This assertion triggers in xfs_dirattr_mark_sick when the caller passes
      in a whichfork value that is neither of XFS_{DATA,ATTR}_FORK.  The cause
      of this is that xchk_directory_blocks only partially initializes the
      xfs_da_args structure that is passed to xfs_dir2_isblock.  If the data
      fork is not correct, the XFS_IS_CORRUPT clause will trigger.  My
      development branch reports this failure to the health monitoring
      subsystem, which accesses the uninitialized args->whichfork field,
      leading the the assertion tripping.  We really shouldn't be passing
      random stack contents around, so the solution here is to force the
      compiler to zero-initialize the struct.
      
      Found by fuzzing u3.bmx[0].blockcount = middlebit on xfs/1554.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      9a48b4a6
  5. 07 11月, 2022 1 次提交
    • D
      xfs: write page faults in iomap are not buffered writes · 118e021b
      Dave Chinner 提交于
      When we reserve a delalloc region in xfs_buffered_write_iomap_begin,
      we mark the iomap as IOMAP_F_NEW so that the the write context
      understands that it allocated the delalloc region.
      
      If we then fail that buffered write, xfs_buffered_write_iomap_end()
      checks for the IOMAP_F_NEW flag and if it is set, it punches out
      the unused delalloc region that was allocated for the write.
      
      The assumption this code makes is that all buffered write operations
      that can allocate space are run under an exclusive lock (i_rwsem).
      This is an invalid assumption: page faults in mmap()d regions call
      through this same function pair to map the file range being faulted
      and this runs only holding the inode->i_mapping->invalidate_lock in
      shared mode.
      
      IOWs, we can have races between page faults and write() calls that
      fail the nested page cache write operation that result in data loss.
      That is, the failing iomap_end call will punch out the data that
      the other racing iomap iteration brought into the page cache. This
      can be reproduced with generic/34[46] if we arbitrarily fail page
      cache copy-in operations from write() syscalls.
      
      Code analysis tells us that the iomap_page_mkwrite() function holds
      the already instantiated and uptodate folio locked across the iomap
      mapping iterations. Hence the folio cannot be removed from memory
      whilst we are mapping the range it covers, and as such we do not
      care if the mapping changes state underneath the iomap iteration
      loop:
      
      1. if the folio is not already dirty, there is no writeback races
         possible.
      2. if we allocated the mapping (delalloc or unwritten), the folio
         cannot already be dirty. See #1.
      3. If the folio is already dirty, it must be up to date. As we hold
         it locked, it cannot be reclaimed from memory. Hence we always
         have valid data in the page cache while iterating the mapping.
      4. Valid data in the page cache can exist when the underlying
         mapping is DELALLOC, UNWRITTEN or WRITTEN. Having the mapping
         change from DELALLOC->UNWRITTEN or UNWRITTEN->WRITTEN does not
         change the data in the page - it only affects actions if we are
         initialising a new page. Hence #3 applies  and we don't care
         about these extent map transitions racing with
         iomap_page_mkwrite().
      5. iomap_page_mkwrite() checks for page invalidation races
         (truncate, hole punch, etc) after it locks the folio. We also
         hold the mapping->invalidation_lock here, and hence the mapping
         cannot change due to extent removal operations while we are
         iterating the folio.
      
      As such, filesystems that don't use bufferheads will never fail
      the iomap_folio_mkwrite_iter() operation on the current mapping,
      regardless of whether the iomap should be considered stale.
      
      Further, the range we are asked to iterate is limited to the range
      inside EOF that the folio spans. Hence, for XFS, we will only map
      the exact range we are asked for, and we will only do speculative
      preallocation with delalloc if we are mapping a hole at the EOF
      page. The iterator will consume the entire range of the folio that
      is within EOF, and anything beyond the EOF block cannot be accessed.
      We never need to truncate this post-EOF speculative prealloc away in
      the context of the iomap_page_mkwrite() iterator because if it
      remains unused we'll remove it when the last reference to the inode
      goes away.
      
      Hence we don't actually need an .iomap_end() cleanup/error handling
      path at all for iomap_page_mkwrite() for XFS. This means we can
      separate the page fault processing from the complexity of the
      .iomap_end() processing in the buffered write path. This also means
      that the buffered write path will also be able to take the
      mapping->invalidate_lock as necessary.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      118e021b
  6. 06 11月, 2022 4 次提交
    • T
      ext4: fix fortify warning in fs/ext4/fast_commit.c:1551 · 0d043351
      Theodore Ts'o 提交于
      With the new fortify string system, rework the memcpy to avoid this
      warning:
      
      memcpy: detected field-spanning write (size 60) of single field "&raw_inode->i_generation" at fs/ext4/fast_commit.c:1551 (size 4)
      
      Cc: stable@kernel.org
      Fixes: 54d9469b ("fortify: Add run-time WARN for cross-field memcpy()")
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      0d043351
    • J
      ext4: fix wrong return err in ext4_load_and_init_journal() · 9f2a1d9f
      Jason Yan 提交于
      The return value is wrong in ext4_load_and_init_journal(). The local
      variable 'err' need to be initialized before goto out. The original code
      in __ext4_fill_super() is fine because it has two return values 'ret'
      and 'err' and 'ret' is initialized as -EINVAL. After we factor out
      ext4_load_and_init_journal(), this code is broken. So fix it by directly
      returning -EINVAL in the error handler path.
      
      Cc: stable@kernel.org
      Fixes: 9c1dd22d ("ext4: factor out ext4_load_and_init_journal()")
      Signed-off-by: NJason Yan <yanaijie@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20221025040206.3134773-1-yanaijie@huawei.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      9f2a1d9f
    • Y
      ext4: fix warning in 'ext4_da_release_space' · 1b8f787e
      Ye Bin 提交于
      Syzkaller report issue as follows:
      EXT4-fs (loop0): Free/Dirty block details
      EXT4-fs (loop0): free_blocks=0
      EXT4-fs (loop0): dirty_blocks=0
      EXT4-fs (loop0): Block reservation details
      EXT4-fs (loop0): i_reserved_data_blocks=0
      EXT4-fs warning (device loop0): ext4_da_release_space:1527: ext4_da_release_space: ino 18, to_free 1 with only 0 reserved data blocks
      ------------[ cut here ]------------
      WARNING: CPU: 0 PID: 92 at fs/ext4/inode.c:1528 ext4_da_release_space+0x25e/0x370 fs/ext4/inode.c:1524
      Modules linked in:
      CPU: 0 PID: 92 Comm: kworker/u4:4 Not tainted 6.0.0-syzkaller-09423-g493ffd66 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/22/2022
      Workqueue: writeback wb_workfn (flush-7:0)
      RIP: 0010:ext4_da_release_space+0x25e/0x370 fs/ext4/inode.c:1528
      RSP: 0018:ffffc900015f6c90 EFLAGS: 00010296
      RAX: 42215896cd52ea00 RBX: 0000000000000000 RCX: 42215896cd52ea00
      RDX: 0000000000000000 RSI: 0000000080000001 RDI: 0000000000000000
      RBP: 1ffff1100e907d96 R08: ffffffff816aa79d R09: fffff520002bece5
      R10: fffff520002bece5 R11: 1ffff920002bece4 R12: ffff888021fd2000
      R13: ffff88807483ecb0 R14: 0000000000000001 R15: ffff88807483e740
      FS:  0000000000000000(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00005555569ba628 CR3: 000000000c88e000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       ext4_es_remove_extent+0x1ab/0x260 fs/ext4/extents_status.c:1461
       mpage_release_unused_pages+0x24d/0xef0 fs/ext4/inode.c:1589
       ext4_writepages+0x12eb/0x3be0 fs/ext4/inode.c:2852
       do_writepages+0x3c3/0x680 mm/page-writeback.c:2469
       __writeback_single_inode+0xd1/0x670 fs/fs-writeback.c:1587
       writeback_sb_inodes+0xb3b/0x18f0 fs/fs-writeback.c:1870
       wb_writeback+0x41f/0x7b0 fs/fs-writeback.c:2044
       wb_do_writeback fs/fs-writeback.c:2187 [inline]
       wb_workfn+0x3cb/0xef0 fs/fs-writeback.c:2227
       process_one_work+0x877/0xdb0 kernel/workqueue.c:2289
       worker_thread+0xb14/0x1330 kernel/workqueue.c:2436
       kthread+0x266/0x300 kernel/kthread.c:376
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306
       </TASK>
      
      Above issue may happens as follows:
      ext4_da_write_begin
        ext4_create_inline_data
          ext4_clear_inode_flag(inode, EXT4_INODE_EXTENTS);
          ext4_set_inode_flag(inode, EXT4_INODE_INLINE_DATA);
      __ext4_ioctl
        ext4_ext_migrate -> will lead to eh->eh_entries not zero, and set extent flag
      ext4_da_write_begin
        ext4_da_convert_inline_data_to_extent
          ext4_da_write_inline_data_begin
            ext4_da_map_blocks
              ext4_insert_delayed_block
      	  if (!ext4_es_scan_clu(inode, &ext4_es_is_delonly, lblk))
      	    if (!ext4_es_scan_clu(inode, &ext4_es_is_mapped, lblk))
      	      ext4_clu_mapped(inode, EXT4_B2C(sbi, lblk)); -> will return 1
      	       allocated = true;
                ext4_es_insert_delayed_block(inode, lblk, allocated);
      ext4_writepages
        mpage_map_and_submit_extent(handle, &mpd, &give_up_on_write); -> return -ENOSPC
        mpage_release_unused_pages(&mpd, give_up_on_write); -> give_up_on_write == 1
          ext4_es_remove_extent
            ext4_da_release_space(inode, reserved);
              if (unlikely(to_free > ei->i_reserved_data_blocks))
      	  -> to_free == 1  but ei->i_reserved_data_blocks == 0
      	  -> then trigger warning as above
      
      To solve above issue, forbid inode do migrate which has inline data.
      
      Cc: stable@kernel.org
      Reported-by: syzbot+c740bb18df70ad00952e@syzkaller.appspotmail.com
      Signed-off-by: NYe Bin <yebin10@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20221018022701.683489-1-yebin10@huawei.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      1b8f787e
    • L
      ext4: fix BUG_ON() when directory entry has invalid rec_len · 17a0bc9b
      Luís Henriques 提交于
      The rec_len field in the directory entry has to be a multiple of 4.  A
      corrupted filesystem image can be used to hit a BUG() in
      ext4_rec_len_to_disk(), called from make_indexed_dir().
      
       ------------[ cut here ]------------
       kernel BUG at fs/ext4/ext4.h:2413!
       ...
       RIP: 0010:make_indexed_dir+0x53f/0x5f0
       ...
       Call Trace:
        <TASK>
        ? add_dirent_to_buf+0x1b2/0x200
        ext4_add_entry+0x36e/0x480
        ext4_add_nondir+0x2b/0xc0
        ext4_create+0x163/0x200
        path_openat+0x635/0xe90
        do_filp_open+0xb4/0x160
        ? __create_object.isra.0+0x1de/0x3b0
        ? _raw_spin_unlock+0x12/0x30
        do_sys_openat2+0x91/0x150
        __x64_sys_open+0x6c/0xa0
        do_syscall_64+0x3c/0x80
        entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      The fix simply adds a call to ext4_check_dir_entry() to validate the
      directory entry, returning -EFSCORRUPTED if the entry is invalid.
      
      CC: stable@kernel.org
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=216540Signed-off-by: NLuís Henriques <lhenriques@suse.de>
      Link: https://lore.kernel.org/r/20221012131330.32456-1-lhenriques@suse.deSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      17a0bc9b