1. 29 11月, 2022 2 次提交
  2. 03 8月, 2022 1 次提交
  3. 17 5月, 2022 1 次提交
    • D
      iomap: don't invalidate folios after writeback errors · e9c3a8e8
      Darrick J. Wong 提交于
      XFS has the unique behavior (as compared to the other Linux filesystems)
      that on writeback errors it will completely invalidate the affected
      folio and force the page cache to reread the contents from disk.  All
      other filesystems leave the page mapped and up to date.
      
      This is a rude awakening for user programs, since (in the case where
      write fails but reread doesn't) file contents will appear to revert to
      old disk contents with no notification other than an EIO on fsync.  This
      might have been annoying back in the days when iomap dealt with one page
      at a time, but with multipage folios, we can now throw away *megabytes*
      worth of data for a single write error.
      
      On *most* Linux filesystems, a program can respond to an EIO on write by
      redirtying the entire file and scheduling it for writeback.  This isn't
      foolproof, since the page that failed writeback is no longer dirty and
      could be evicted, but programs that want to recover properly *also*
      have to detect XFS and regenerate every write they've made to the file.
      
      When running xfs/314 on arm64, I noticed a UAF when xfs_discard_folio
      invalidates multipage folios that could be undergoing writeback.  If,
      say, we have a 256K folio caching a mix of written and unwritten
      extents, it's possible that we could start writeback of the first (say)
      64K of the folio and then hit a writeback error on the next 64K.  We
      then free the iop attached to the folio, which is really bad because
      writeback completion on the first 64k will trip over the "blocks per
      folio > 1 && !iop" assertion.
      
      This can't be fixed by only invalidating the folio if writeback fails at
      the start of the folio, since the folio is marked !uptodate, which trips
      other assertions elsewhere.  Get rid of the whole behavior entirely.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      e9c3a8e8
  4. 10 5月, 2022 2 次提交
  5. 17 3月, 2022 1 次提交
  6. 15 3月, 2022 3 次提交
  7. 27 1月, 2022 1 次提交
    • D
      xfs, iomap: limit individual ioend chain lengths in writeback · ebb7fb15
      Dave Chinner 提交于
      Trond Myklebust reported soft lockups in XFS IO completion such as
      this:
      
       watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [kworker/12:1:3106]
       CPU: 12 PID: 3106 Comm: kworker/12:1 Not tainted 4.18.0-305.10.2.el8_4.x86_64 #1
       Workqueue: xfs-conv/md127 xfs_end_io [xfs]
       RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
       Call Trace:
        wake_up_page_bit+0x8a/0x110
        iomap_finish_ioend+0xd7/0x1c0
        iomap_finish_ioends+0x7f/0xb0
        xfs_end_ioend+0x6b/0x100 [xfs]
        xfs_end_io+0xb9/0xe0 [xfs]
        process_one_work+0x1a7/0x360
        worker_thread+0x1fa/0x390
        kthread+0x116/0x130
        ret_from_fork+0x35/0x40
      
      Ioends are processed as an atomic completion unit when all the
      chained bios in the ioend have completed their IO. Logically
      contiguous ioends can also be merged and completed as a single,
      larger unit.  Both of these things can be problematic as both the
      bio chains per ioend and the size of the merged ioends processed as
      a single completion are both unbound.
      
      If we have a large sequential dirty region in the page cache,
      write_cache_pages() will keep feeding us sequential pages and we
      will keep mapping them into ioends and bios until we get a dirty
      page at a non-sequential file offset. These large sequential runs
      can will result in bio and ioend chaining to optimise the io
      patterns. The pages iunder writeback are pinned within these chains
      until the submission chaining is broken, allowing the entire chain
      to be completed. This can result in huge chains being processed
      in IO completion context.
      
      We get deep bio chaining if we have large contiguous physical
      extents. We will keep adding pages to the current bio until it is
      full, then we'll chain a new bio to keep adding pages for writeback.
      Hence we can build bio chains that map millions of pages and tens of
      gigabytes of RAM if the page cache contains big enough contiguous
      dirty file regions. This long bio chain pins those pages until the
      final bio in the chain completes and the ioend can iterate all the
      chained bios and complete them.
      
      OTOH, if we have a physically fragmented file, we end up submitting
      one ioend per physical fragment that each have a small bio or bio
      chain attached to them. We do not chain these at IO submission time,
      but instead we chain them at completion time based on file
      offset via iomap_ioend_try_merge(). Hence we can end up with unbound
      ioend chains being built via completion merging.
      
      XFS can then do COW remapping or unwritten extent conversion on that
      merged chain, which involves walking an extent fragment at a time
      and running a transaction to modify the physical extent information.
      IOWs, we merge all the discontiguous ioends together into a
      contiguous file range, only to then process them individually as
      discontiguous extents.
      
      This extent manipulation is computationally expensive and can run in
      a tight loop, so merging logically contiguous but physically
      discontigous ioends gains us nothing except for hiding the fact the
      fact we broke the ioends up into individual physical extents at
      submission and then need to loop over those individual physical
      extents at completion.
      
      Hence we need to have mechanisms to limit ioend sizes and
      to break up completion processing of large merged ioend chains:
      
      1. bio chains per ioend need to be bound in length. Pure overwrites
      go straight to iomap_finish_ioend() in softirq context with the
      exact bio chain attached to the ioend by submission. Hence the only
      way to prevent long holdoffs here is to bound ioend submission
      sizes because we can't reschedule in softirq context.
      
      2. iomap_finish_ioends() has to handle unbound merged ioend chains
      correctly. This relies on any one call to iomap_finish_ioend() being
      bound in runtime so that cond_resched() can be issued regularly as
      the long ioend chain is processed. i.e. this relies on mechanism #1
      to limit individual ioend sizes to work correctly.
      
      3. filesystems have to loop over the merged ioends to process
      physical extent manipulations. This means they can loop internally,
      and so we break merging at physical extent boundaries so the
      filesystem can easily insert reschedule points between individual
      extent manipulations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reported-and-tested-by: NTrond Myklebust <trondmy@hammerspace.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      ebb7fb15
  8. 18 12月, 2021 1 次提交
  9. 05 12月, 2021 1 次提交
  10. 23 10月, 2021 1 次提交
    • B
      xfs: punch out data fork delalloc blocks on COW writeback failure · 5ca5916b
      Brian Foster 提交于
      If writeback I/O to a COW extent fails, the COW fork blocks are
      punched out and the data fork blocks left alone. It is possible for
      COW fork blocks to overlap non-shared data fork blocks (due to
      cowextsz hint prealloc), however, and writeback unconditionally maps
      to the COW fork whenever blocks exist at the corresponding offset of
      the page undergoing writeback. This means it's quite possible for a
      COW fork extent to overlap delalloc data fork blocks, writeback to
      convert and map to the COW fork blocks, writeback to fail, and
      finally for ioend completion to cancel the COW fork blocks and leave
      stale data fork delalloc blocks around in the inode. The blocks are
      effectively stale because writeback failure also discards dirty page
      state.
      
      If this occurs, it is likely to trigger assert failures, free space
      accounting corruption and failures in unrelated file operations. For
      example, a subsequent reflink attempt of the affected file to a new
      target file will trip over the stale delalloc in the source file and
      fail. Several of these issues are occasionally reproduced by
      generic/648, but are reproducible on demand with the right sequence
      of operations and timely I/O error injection.
      
      To fix this problem, update the ioend failure path to also punch out
      underlying data fork delalloc blocks on I/O error. This is analogous
      to the writeback submission failure path in xfs_discard_page() where
      we might fail to map data fork delalloc blocks and consistent with
      the successful COW writeback completion path, which is responsible
      for unmapping from the data fork and remapping in COW fork blocks.
      
      Fixes: 787eb485 ("xfs: fix and streamline error handling in xfs_end_io")
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      5ca5916b
  11. 20 8月, 2021 1 次提交
  12. 19 8月, 2021 1 次提交
    • D
      xfs: drop ->writepage completely · 21b4ee70
      Dave Chinner 提交于
      ->writepage is only used in one place - single page writeback from
      memory reclaim. We only allow such writeback from kswapd, not from
      direct memory reclaim, and so it is rarely used. When it comes from
      kswapd, it is effectively random dirty page shoot-down, which is
      horrible for IO patterns. We will already have background writeback
      trying to clean all the dirty pages in memory as efficiently as
      possible, so having kswapd interrupt our well formed IO stream only
      slows things down. So get rid of xfs_vm_writepage() completely.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [djwong: forward port to 5.15]
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      21b4ee70
  13. 30 6月, 2021 2 次提交
  14. 04 5月, 2021 1 次提交
  15. 16 4月, 2021 1 次提交
  16. 10 4月, 2021 4 次提交
  17. 08 4月, 2021 1 次提交
  18. 26 3月, 2021 1 次提交
  19. 26 2月, 2021 1 次提交
  20. 05 11月, 2020 2 次提交
    • D
      xfs: fix missing CoW blocks writeback conversion retry · c2f09217
      Darrick J. Wong 提交于
      In commit 7588cbee, we tried to fix a race stemming from the lack of
      coordination between higher level code that wants to allocate and remap
      CoW fork extents into the data fork.  Christoph cites as examples the
      always_cow mode, and a directio write completion racing with writeback.
      
      According to the comments before the goto retry, we want to restart the
      lookup to catch the extent in the data fork, but we don't actually reset
      whichfork or cow_fsb, which means the second try executes using stale
      information.  Up until now I think we've gotten lucky that either
      there's something left in the CoW fork to cause cow_fsb to be reset, or
      either data/cow fork sequence numbers have advanced enough to force a
      fresh lookup from the data fork.  However, if we reach the retry with an
      empty stable CoW fork and a stable data fork, neither of those things
      happens.  The retry foolishly re-calls xfs_convert_blocks on the CoW
      fork which fails again.  This time, we toss the write.
      
      I've recently been working on extending reflink to the realtime device.
      When the realtime extent size is larger than a single block, we have to
      force the page cache to CoW the entire rt extent if a write (or
      fallocate) are not aligned with the rt extent size.  The strategy I've
      chosen to deal with this is derived from Dave's blocksize > pagesize
      series: dirtying around the write range, and ensuring that writeback
      always starts mapping on an rt extent boundary.  This has brought this
      race front and center, since generic/522 blows up immediately.
      
      However, I'm pretty sure this is a bug outright, independent of that.
      
      Fixes: 7588cbee ("xfs: retry COW fork delalloc conversion when no extent was found")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c2f09217
    • B
      iomap: support partial page discard on writeback block mapping failure · 763e4cdc
      Brian Foster 提交于
      iomap writeback mapping failure only calls into ->discard_page() if
      the current page has not been added to the ioend. Accordingly, the
      XFS callback assumes a full page discard and invalidation. This is
      problematic for sub-page block size filesystems where some portion
      of a page might have been mapped successfully before a failure to
      map a delalloc block occurs. ->discard_page() is not called in that
      error scenario and the bio is explicitly failed by iomap via the
      error return from ->prepare_ioend(). As a result, the filesystem
      leaks delalloc blocks and corrupts the filesystem block counters.
      
      Since XFS is the only user of ->discard_page(), tweak the semantics
      to invoke the callback unconditionally on mapping errors and provide
      the file offset that failed to map. Update xfs_discard_page() to
      discard the corresponding portion of the file and pass the range
      along to iomap_invalidatepage(). The latter already properly handles
      both full and sub-page scenarios by not changing any iomap or page
      state on sub-page invalidations.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      763e4cdc
  21. 21 9月, 2020 1 次提交
  22. 03 6月, 2020 1 次提交
  23. 20 5月, 2020 1 次提交
  24. 03 3月, 2020 1 次提交
  25. 04 1月, 2020 1 次提交
  26. 28 10月, 2019 1 次提交
  27. 22 10月, 2019 1 次提交
  28. 21 10月, 2019 4 次提交