1. 16 2月, 2023 1 次提交
  2. 29 11月, 2022 1 次提交
    • D
      iomap: write iomap validity checks · d7b64041
      Dave Chinner 提交于
      A recent multithreaded write data corruption has been uncovered in
      the iomap write code. The core of the problem is partial folio
      writes can be flushed to disk while a new racing write can map it
      and fill the rest of the page:
      
      writeback			new write
      
      allocate blocks
        blocks are unwritten
      submit IO
      .....
      				map blocks
      				iomap indicates UNWRITTEN range
      				loop {
      				  lock folio
      				  copyin data
      .....
      IO completes
        runs unwritten extent conv
          blocks are marked written
      				  <iomap now stale>
      				  get next folio
      				}
      
      Now add memory pressure such that memory reclaim evicts the
      partially written folio that has already been written to disk.
      
      When the new write finally gets to the last partial page of the new
      write, it does not find it in cache, so it instantiates a new page,
      sees the iomap is unwritten, and zeros the part of the page that
      it does not have data from. This overwrites the data on disk that
      was originally written.
      
      The full description of the corruption mechanism can be found here:
      
      https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
      
      To solve this problem, we need to check whether the iomap is still
      valid after we lock each folio during the write. We have to do it
      after we lock the page so that we don't end up with state changes
      occurring while we wait for the folio to be locked.
      
      Hence we need a mechanism to be able to check that the cached iomap
      is still valid (similar to what we already do in buffered
      writeback), and we need a way for ->begin_write to back out and
      tell the high level iomap iterator that we need to remap the
      remaining write range.
      
      The iomap needs to grow some storage for the validity cookie that
      the filesystem provides to travel with the iomap. XFS, in
      particular, also needs to know some more information about what the
      iomap maps (attribute extents rather than file data extents) to for
      the validity cookie to cover all the types of iomaps we might need
      to validate.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      d7b64041
  3. 23 11月, 2022 1 次提交
    • D
      xfs,iomap: move delalloc punching to iomap · 9c7babf9
      Dave Chinner 提交于
      Because that's what Christoph wants for this error handling path
      only XFS uses.
      
      It requires a new iomap export for handling errors over delalloc
      ranges. This is basically the XFS code as is stands, but even though
      Christoph wants this as iomap funcitonality, we still have 
      to call it from the filesystem specific ->iomap_end callback, and
      call into the iomap code with yet another filesystem specific
      callback to punch the delalloc extent within the defined ranges.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      9c7babf9
  4. 03 8月, 2022 1 次提交
  5. 23 7月, 2022 1 次提交
  6. 11 6月, 2022 1 次提交
  7. 16 5月, 2022 2 次提交
  8. 10 5月, 2022 2 次提交
  9. 15 3月, 2022 2 次提交
  10. 27 1月, 2022 1 次提交
    • D
      xfs, iomap: limit individual ioend chain lengths in writeback · ebb7fb15
      Dave Chinner 提交于
      Trond Myklebust reported soft lockups in XFS IO completion such as
      this:
      
       watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [kworker/12:1:3106]
       CPU: 12 PID: 3106 Comm: kworker/12:1 Not tainted 4.18.0-305.10.2.el8_4.x86_64 #1
       Workqueue: xfs-conv/md127 xfs_end_io [xfs]
       RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
       Call Trace:
        wake_up_page_bit+0x8a/0x110
        iomap_finish_ioend+0xd7/0x1c0
        iomap_finish_ioends+0x7f/0xb0
        xfs_end_ioend+0x6b/0x100 [xfs]
        xfs_end_io+0xb9/0xe0 [xfs]
        process_one_work+0x1a7/0x360
        worker_thread+0x1fa/0x390
        kthread+0x116/0x130
        ret_from_fork+0x35/0x40
      
      Ioends are processed as an atomic completion unit when all the
      chained bios in the ioend have completed their IO. Logically
      contiguous ioends can also be merged and completed as a single,
      larger unit.  Both of these things can be problematic as both the
      bio chains per ioend and the size of the merged ioends processed as
      a single completion are both unbound.
      
      If we have a large sequential dirty region in the page cache,
      write_cache_pages() will keep feeding us sequential pages and we
      will keep mapping them into ioends and bios until we get a dirty
      page at a non-sequential file offset. These large sequential runs
      can will result in bio and ioend chaining to optimise the io
      patterns. The pages iunder writeback are pinned within these chains
      until the submission chaining is broken, allowing the entire chain
      to be completed. This can result in huge chains being processed
      in IO completion context.
      
      We get deep bio chaining if we have large contiguous physical
      extents. We will keep adding pages to the current bio until it is
      full, then we'll chain a new bio to keep adding pages for writeback.
      Hence we can build bio chains that map millions of pages and tens of
      gigabytes of RAM if the page cache contains big enough contiguous
      dirty file regions. This long bio chain pins those pages until the
      final bio in the chain completes and the ioend can iterate all the
      chained bios and complete them.
      
      OTOH, if we have a physically fragmented file, we end up submitting
      one ioend per physical fragment that each have a small bio or bio
      chain attached to them. We do not chain these at IO submission time,
      but instead we chain them at completion time based on file
      offset via iomap_ioend_try_merge(). Hence we can end up with unbound
      ioend chains being built via completion merging.
      
      XFS can then do COW remapping or unwritten extent conversion on that
      merged chain, which involves walking an extent fragment at a time
      and running a transaction to modify the physical extent information.
      IOWs, we merge all the discontiguous ioends together into a
      contiguous file range, only to then process them individually as
      discontiguous extents.
      
      This extent manipulation is computationally expensive and can run in
      a tight loop, so merging logically contiguous but physically
      discontigous ioends gains us nothing except for hiding the fact the
      fact we broke the ioends up into individual physical extents at
      submission and then need to loop over those individual physical
      extents at completion.
      
      Hence we need to have mechanisms to limit ioend sizes and
      to break up completion processing of large merged ioend chains:
      
      1. bio chains per ioend need to be bound in length. Pure overwrites
      go straight to iomap_finish_ioend() in softirq context with the
      exact bio chain attached to the ioend by submission. Hence the only
      way to prevent long holdoffs here is to bound ioend submission
      sizes because we can't reschedule in softirq context.
      
      2. iomap_finish_ioends() has to handle unbound merged ioend chains
      correctly. This relies on any one call to iomap_finish_ioend() being
      bound in runtime so that cond_resched() can be issued regularly as
      the long ioend chain is processed. i.e. this relies on mechanism #1
      to limit individual ioend sizes to work correctly.
      
      3. filesystems have to loop over the merged ioends to process
      physical extent manipulations. This means they can loop internally,
      and so we break merging at physical extent boundaries so the
      filesystem can easily insert reschedule points between individual
      extent manipulations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reported-and-tested-by: NTrond Myklebust <trondmy@hammerspace.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      ebb7fb15
  11. 18 12月, 2021 1 次提交
  12. 17 12月, 2021 1 次提交
  13. 05 12月, 2021 1 次提交
  14. 24 10月, 2021 2 次提交
  15. 18 10月, 2021 2 次提交
  16. 17 8月, 2021 9 次提交
  17. 04 8月, 2021 1 次提交
  18. 30 6月, 2021 1 次提交
  19. 04 5月, 2021 1 次提交
  20. 09 2月, 2021 1 次提交
  21. 24 1月, 2021 2 次提交
  22. 05 11月, 2020 1 次提交
    • B
      iomap: support partial page discard on writeback block mapping failure · 763e4cdc
      Brian Foster 提交于
      iomap writeback mapping failure only calls into ->discard_page() if
      the current page has not been added to the ioend. Accordingly, the
      XFS callback assumes a full page discard and invalidation. This is
      problematic for sub-page block size filesystems where some portion
      of a page might have been mapped successfully before a failure to
      map a delalloc block occurs. ->discard_page() is not called in that
      error scenario and the bio is explicitly failed by iomap via the
      error return from ->prepare_ioend(). As a result, the filesystem
      leaks delalloc blocks and corrupts the filesystem block counters.
      
      Since XFS is the only user of ->discard_page(), tweak the semantics
      to invoke the callback unconditionally on mapping errors and provide
      the file offset that failed to map. Update xfs_discard_page() to
      discard the corresponding portion of the file and pass the range
      along to iomap_invalidatepage(). The latter already properly handles
      both full and sub-page scenarios by not changing any iomap or page
      state on sub-page invalidations.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      763e4cdc
  23. 28 9月, 2020 1 次提交
  24. 04 6月, 2020 1 次提交
  25. 03 6月, 2020 1 次提交
  26. 25 5月, 2020 1 次提交