1. 29 11月, 2022 6 次提交
    • D
      Merge tag 'xfs-iomap-stale-fixes' of... · 7dd73802
      Darrick J. Wong 提交于
      Merge tag 'xfs-iomap-stale-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs into xfs-6.2-mergeB
      
      xfs, iomap: fix data corruption due to stale cached iomaps
      
      This patch series fixes a data corruption that occurs in a specific
      multi-threaded write workload. The workload combined
      racing unaligned adjacent buffered writes with low memory conditions
      that caused both writeback and memory reclaim to race with the
      writes.
      
      The result of this was random partial blocks containing zeroes
      instead of the correct data.  The underlying problem is that iomap
      caches the write iomap for the duration of the write() operation,
      but it fails to take into account that the extent underlying the
      iomap can change whilst the write is in progress.
      
      The short story is that an iomap can span mutliple folios, and so
      under low memory writeback can be cleaning folios the write()
      overlaps. Whilst the overlapping data is cached in memory, this
      isn't a problem, but because the folios are now clean they can be
      reclaimed. Once reclaimed, the write() does the wrong thing when
      re-instantiating partial folios because the iomap no longer reflects
      the underlying state of the extent. e.g. it thinks the extent is
      unwritten, so it zeroes the partial range, when in fact the
      underlying extent is now written and so it should have read the data
      from disk.  This is how we get random zero ranges in the file
      instead of the correct data.
      
      The gory details of the race condition can be found here:
      
      https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
      
      Fixing the problem has two aspects. The first aspect of the problem
      is ensuring that iomap can detect a stale cached iomap during a
      write in a race-free manner. We already do this stale iomap
      detection in the writeback path, so we have a mechanism for
      detecting that the iomap backing the data range may have changed
      and needs to be remapped.
      
      In the case of the write() path, we have to ensure that the iomap is
      validated at a point in time when the page cache is stable and
      cannot be reclaimed from under us. We also need to validate the
      extent before we start performing any modifications to the folio
      state or contents. Combine these two requirements together, and the
      only "safe" place to validate the iomap is after we have looked up
      and locked the folio we are going to copy the data into, but before
      we've performed any initialisation operations on that folio.
      
      If the iomap fails validation, we then mark it stale, unlock the
      folio and end the write. This effectively means a stale iomap
      results in a short write. Filesystems should already be able to
      handle this, as write operations can end short for many reasons and
      need to iterate through another mapping cycle to be completed. Hence
      the iomap changes needed to detect and handle stale iomaps during
      write() operations is relatively simple...
      
      However, the assumption is that filesystems should already be able
      to handle write failures safely, and that's where the second
      (first?) part of the problem exists. That is, handling a partial
      write is harder than just "punching out the unused delayed
      allocation extent". This is because mmap() based faults can race
      with writes, and if they land in the delalloc region that the write
      allocated, then punching out the delalloc region can cause data
      corruption.
      
      This data corruption problem is exposed by generic/346 when iomap is
      converted to detect stale iomaps during write() operations. Hence
      write failure in the filesytems needs to handle the fact that the
      write() in progress doesn't necessarily own the data in the page
      cache over the range of the delalloc extent it just allocated.
      
      As a result, we can't just truncate the page cache over the range
      the write() didn't reach and punch all the delalloc extent. We have
      to walk the page cache over the untouched range and skip over any
      dirty data region in the cache in that range. Which is ....
      non-trivial.
      
      That is, iterating the page cache has to handle partially populated
      folios (i.e. block size < page size) that contain data. The data
      might be discontiguous within a folio. Indeed, there might be
      *multiple* discontiguous data regions within a single folio. And to
      make matters more complex, multi-page folios mean we just don't know
      how many sub-folio regions we might have to iterate to find all
      these regions. All the corner cases between the conversions and
      rounding between filesystem block size, folio size and multi-page
      folio size combined with unaligned write offsets kept breaking my
      brain.
      
      However, if we convert the code to track the processed
      write regions by byte ranges instead of fileystem block or page
      cache index, we could simply use mapping_seek_hole_data() to find
      the start and end of each discrete data region within the range we
      needed to scan. SEEK_DATA finds the start of the cached data region,
      SEEK_HOLE finds the end of the region. These are byte based
      interfaces that understand partially uptodate folio regions, and so
      can iterate discrete sub-folio data regions directly. This largely
      solved the problem of discovering the dirty regions we need to keep
      the delalloc extent over.
      
      However, to use mapping_seek_hole_data() without needing to export
      it, we have to move all the delalloc extent cleanup to the iomap
      core and so now the iomap core can clean up delayed allocation
      extents in a safe, sane and filesystem neutral manner.
      
      With all this done, the original data corruption never occurs
      anymore, and we now have a generic mechanism for ensuring that page
      cache writes do not do the wrong thing when writeback and reclaim
      change the state of the physical extent and/or page cache contents
      whilst the write is in progress.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      
      * tag 'xfs-iomap-stale-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs:
        xfs: drop write error injection is unfixable, remove it
        xfs: use iomap_valid method to detect stale cached iomaps
        iomap: write iomap validity checks
        xfs: xfs_bmap_punch_delalloc_range() should take a byte range
        iomap: buffered write failure should not truncate the page cache
        xfs,iomap: move delalloc punching to iomap
        xfs: use byte ranges for write cleanup ranges
        xfs: punching delalloc extents on write failure is racy
        xfs: write page faults in iomap are not buffered writes
      7dd73802
    • D
      xfs: drop write error injection is unfixable, remove it · 6e8af15c
      Dave Chinner 提交于
      With the changes to scan the page cache for dirty data to avoid data
      corruptions from partial write cleanup racing with other page cache
      operations, the drop writes error injection no longer works the same
      way it used to and causes xfs/196 to fail. This is because xfs/196
      writes to the file and populates the page cache before it turns on
      the error injection and starts failing -overwrites-.
      
      The result is that the original drop-writes code failed writes only
      -after- overwriting the data in the cache, followed by invalidates
      the cached data, then punching out the delalloc extent from under
      that data.
      
      On the surface, this looks fine. The problem is that page cache
      invalidation *doesn't guarantee that it removes anything from the
      page cache* and it doesn't change the dirty state of the folio. When
      block size == page size and we do page aligned IO (as xfs/196 does)
      everything happens to align perfectly and page cache invalidation
      removes the single page folios that span the written data. Hence the
      followup delalloc punch pass does not find cached data over that
      range and it can punch the extent out.
      
      IOWs, xfs/196 "works" for block size == page size with the new
      code. I say "works", because it actually only works for the case
      where IO is page aligned, and no data was read from disk before
      writes occur. Because the moment we actually read data first, the
      readahead code allocates multipage folios and suddenly the
      invalidate code goes back to zeroing subfolio ranges without
      changing dirty state.
      
      Hence, with multipage folios in play, block size == page size is
      functionally identical to block size < page size behaviour, and
      drop-writes is manifestly broken w.r.t to this case. Invalidation of
      a subfolio range doesn't result in the folio being removed from the
      cache, just the range gets zeroed. Hence after we've sequentially
      walked over a folio that we've dirtied (via write data) and then
      invalidated, we end up with a dirty folio full of zeroed data.
      
      And because the new code skips punching ranges that have dirty
      folios covering them, we end up leaving the delalloc range intact
      after failing all the writes. Hence failed writes now end up
      writing zeroes to disk in the cases where invalidation zeroes folios
      rather than removing them from cache.
      
      This is a fundamental change of behaviour that is needed to avoid
      the data corruption vectors that exist in the old write fail path,
      and it renders the drop-writes injection non-functional and
      unworkable as it stands.
      
      As it is, I think the error injection is also now unnecessary, as
      partial writes that need delalloc extent are going to be a lot more
      common with stale iomap detection in place. Hence this patch removes
      the drop-writes error injection completely. xfs/196 can remain for
      testing kernels that don't have this data corruption fix, but those
      that do will report:
      
      xfs/196 3s ... [not run] XFS error injection drop_writes unknown on this kernel.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      6e8af15c
    • D
      xfs: use iomap_valid method to detect stale cached iomaps · 304a68b9
      Dave Chinner 提交于
      Now that iomap supports a mechanism to validate cached iomaps for
      buffered write operations, hook it up to the XFS buffered write ops
      so that we can avoid data corruptions that result from stale cached
      iomaps. See:
      
      https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
      
      or the ->iomap_valid() introduction commit for exact details of the
      corruption vector.
      
      The validity cookie we store in the iomap is based on the type of
      iomap we return. It is expected that the iomap->flags we set in
      xfs_bmbt_to_iomap() is not perturbed by the iomap core and are
      returned to us in the iomap passed via the .iomap_valid() callback.
      This ensures that the validity cookie is always checking the correct
      inode fork sequence numbers to detect potential changes that affect
      the extent cached by the iomap.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      304a68b9
    • D
      iomap: write iomap validity checks · d7b64041
      Dave Chinner 提交于
      A recent multithreaded write data corruption has been uncovered in
      the iomap write code. The core of the problem is partial folio
      writes can be flushed to disk while a new racing write can map it
      and fill the rest of the page:
      
      writeback			new write
      
      allocate blocks
        blocks are unwritten
      submit IO
      .....
      				map blocks
      				iomap indicates UNWRITTEN range
      				loop {
      				  lock folio
      				  copyin data
      .....
      IO completes
        runs unwritten extent conv
          blocks are marked written
      				  <iomap now stale>
      				  get next folio
      				}
      
      Now add memory pressure such that memory reclaim evicts the
      partially written folio that has already been written to disk.
      
      When the new write finally gets to the last partial page of the new
      write, it does not find it in cache, so it instantiates a new page,
      sees the iomap is unwritten, and zeros the part of the page that
      it does not have data from. This overwrites the data on disk that
      was originally written.
      
      The full description of the corruption mechanism can be found here:
      
      https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
      
      To solve this problem, we need to check whether the iomap is still
      valid after we lock each folio during the write. We have to do it
      after we lock the page so that we don't end up with state changes
      occurring while we wait for the folio to be locked.
      
      Hence we need a mechanism to be able to check that the cached iomap
      is still valid (similar to what we already do in buffered
      writeback), and we need a way for ->begin_write to back out and
      tell the high level iomap iterator that we need to remap the
      remaining write range.
      
      The iomap needs to grow some storage for the validity cookie that
      the filesystem provides to travel with the iomap. XFS, in
      particular, also needs to know some more information about what the
      iomap maps (attribute extents rather than file data extents) to for
      the validity cookie to cover all the types of iomaps we might need
      to validate.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      d7b64041
    • D
      xfs: xfs_bmap_punch_delalloc_range() should take a byte range · 7348b322
      Dave Chinner 提交于
      All the callers of xfs_bmap_punch_delalloc_range() jump through
      hoops to convert a byte range to filesystem blocks before calling
      xfs_bmap_punch_delalloc_range(). Instead, pass the byte range to
      xfs_bmap_punch_delalloc_range() and have it do the conversion to
      filesystem blocks internally.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      7348b322
    • D
      iomap: buffered write failure should not truncate the page cache · f43dc4dc
      Dave Chinner 提交于
      iomap_file_buffered_write_punch_delalloc() currently invalidates the
      page cache over the unused range of the delalloc extent that was
      allocated. While the write allocated the delalloc extent, it does
      not own it exclusively as the write does not hold any locks that
      prevent either writeback or mmap page faults from changing the state
      of either the page cache or the extent state backing this range.
      
      Whilst xfs_bmap_punch_delalloc_range() already handles races in
      extent conversion - it will only punch out delalloc extents and it
      ignores any other type of extent - the page cache truncate does not
      discriminate between data written by this write or some other task.
      As a result, truncating the page cache can result in data corruption
      if the write races with mmap modifications to the file over the same
      range.
      
      generic/346 exercises this workload, and if we randomly fail writes
      (as will happen when iomap gets stale iomap detection later in the
      patchset), it will randomly corrupt the file data because it removes
      data written by mmap() in the same page as the write() that failed.
      
      Hence we do not want to punch out the page cache over the range of
      the extent we failed to write to - what we actually need to do is
      detect the ranges that have dirty data in cache over them and *not
      punch them out*.
      
      To do this, we have to walk the page cache over the range of the
      delalloc extent we want to remove. This is made complex by the fact
      we have to handle partially up-to-date folios correctly and this can
      happen even when the FSB size == PAGE_SIZE because we now support
      multi-page folios in the page cache.
      
      Because we are only interested in discovering the edges of data
      ranges in the page cache (i.e. hole-data boundaries) we can make use
      of mapping_seek_hole_data() to find those transitions in the page
      cache. As we hold the invalidate_lock, we know that the boundaries
      are not going to change while we walk the range. This interface is
      also byte-based and is sub-page block aware, so we can find the data
      ranges in the cache based on byte offsets rather than page, folio or
      fs block sized chunks. This greatly simplifies the logic of finding
      dirty cached ranges in the page cache.
      
      Once we've identified a range that contains cached data, we can then
      iterate the range folio by folio. This allows us to determine if the
      data is dirty and hence perform the correct delalloc extent punching
      operations. The seek interface we use to iterate data ranges will
      give us sub-folio start/end granularity, so we may end up looking up
      the same folio multiple times as the seek interface iterates across
      each discontiguous data region in the folio.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      f43dc4dc
  2. 23 11月, 2022 3 次提交
    • D
      xfs,iomap: move delalloc punching to iomap · 9c7babf9
      Dave Chinner 提交于
      Because that's what Christoph wants for this error handling path
      only XFS uses.
      
      It requires a new iomap export for handling errors over delalloc
      ranges. This is basically the XFS code as is stands, but even though
      Christoph wants this as iomap funcitonality, we still have 
      to call it from the filesystem specific ->iomap_end callback, and
      call into the iomap code with yet another filesystem specific
      callback to punch the delalloc extent within the defined ranges.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      9c7babf9
    • D
      xfs: use byte ranges for write cleanup ranges · b71f889c
      Dave Chinner 提交于
      xfs_buffered_write_iomap_end() currently converts the byte ranges
      passed to it to filesystem blocks to pass them to the bmap code to
      punch out delalloc blocks, but then has to convert filesytem
      blocks back to byte ranges for page cache truncate.
      
      We're about to make the page cache truncate go away and replace it
      with a page cache walk, so having to convert everything to/from/to
      filesystem blocks is messy and error-prone. It is much easier to
      pass around byte ranges and convert to page indexes and/or
      filesystem blocks only where those units are needed.
      
      In preparation for the page cache walk being added, add a helper
      that converts byte ranges to filesystem blocks and calls
      xfs_bmap_punch_delalloc_range() and convert
      xfs_buffered_write_iomap_end() to calculate limits in byte ranges.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      b71f889c
    • D
      xfs: punching delalloc extents on write failure is racy · 198dd8ae
      Dave Chinner 提交于
      xfs_buffered_write_iomap_end() has a comment about the safety of
      punching delalloc extents based holding the IOLOCK_EXCL. This
      comment is wrong, and punching delalloc extents is not race free.
      
      When we punch out a delalloc extent after a write failure in
      xfs_buffered_write_iomap_end(), we punch out the page cache with
      truncate_pagecache_range() before we punch out the delalloc extents.
      At this point, we only hold the IOLOCK_EXCL, so there is nothing
      stopping mmap() write faults racing with this cleanup operation,
      reinstantiating a folio over the range we are about to punch and
      hence requiring the delalloc extent to be kept.
      
      If this race condition is hit, we can end up with a dirty page in
      the page cache that has no delalloc extent or space reservation
      backing it. This leads to bad things happening at writeback time.
      
      To avoid this race condition, we need the page cache truncation to
      be atomic w.r.t. the extent manipulation. We can do this by holding
      the mapping->invalidate_lock exclusively across this operation -
      this will prevent new pages from being inserted into the page cache
      whilst we are removing the pages and the backing extent and space
      reservation.
      
      Taking the mapping->invalidate_lock exclusively in the buffered
      write IO path is safe - it naturally nests inside the IOLOCK (see
      truncate and fallocate paths). iomap_zero_range() can be called from
      under the mapping->invalidate_lock (from the truncate path via
      either xfs_zero_eof() or xfs_truncate_page(), but iomap_zero_iter()
      will not instantiate new delalloc pages (because it skips holes) and
      hence will not ever need to punch out delalloc extents on failure.
      
      Fix the locking issue, and clean up the code logic a little to avoid
      unnecessary work if we didn't allocate the delalloc extent or wrote
      the entire region we allocated.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      198dd8ae
  3. 22 11月, 2022 1 次提交
    • L
      xfs: fix incorrect i_nlink caused by inode racing · 28b4b059
      Long Li 提交于
      The following error occurred during the fsstress test:
      
      XFS: Assertion failed: VFS_I(ip)->i_nlink >= 2, file: fs/xfs/xfs_inode.c, line: 2452
      
      The problem was that inode race condition causes incorrect i_nlink to be
      written to disk, and then it is read into memory. Consider the following
      call graph, inodes that are marked as both XFS_IFLUSHING and
      XFS_IRECLAIMABLE, i_nlink will be reset to 1 and then restored to original
      value in xfs_reinit_inode(). Therefore, the i_nlink of directory on disk
      may be set to 1.
      
        xfsaild
            xfs_inode_item_push
                xfs_iflush_cluster
                    xfs_iflush
                        xfs_inode_to_disk
      
        xfs_iget
            xfs_iget_cache_hit
                xfs_iget_recycle
                    xfs_reinit_inode
                        inode_init_always
      
      xfs_reinit_inode() needs to hold the ILOCK_EXCL as it is changing internal
      inode state and can race with other RCU protected inode lookups. On the
      read side, xfs_iflush_cluster() grabs the ILOCK_SHARED while under rcu +
      ip->i_flags_lock, and so xfs_iflush/xfs_inode_to_disk() are protected from
      racing inode updates (during transactions) by that lock.
      
      Fixes: ff7bebeb ("xfs: refactor the inode recycling code") # goes further back than this
      Signed-off-by: NLong Li <leo.lilong@huawei.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      28b4b059
  4. 17 11月, 2022 30 次提交