1. 29 3月, 2023 1 次提交
    • D
      xfs, iomap: limit individual ioend chain lengths in writeback · c5883137
      Dave Chinner 提交于
      mainline inclusion
      from mainline-v5.17-rc3
      commit ebb7fb15
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ebb7fb1557b1d03b906b668aa2164b51e6b7d19a
      
      --------------------------------
      
      Trond Myklebust reported soft lockups in XFS IO completion such as
      this:
      
       watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [kworker/12:1:3106]
       CPU: 12 PID: 3106 Comm: kworker/12:1 Not tainted 4.18.0-305.10.2.el8_4.x86_64 #1
       Workqueue: xfs-conv/md127 xfs_end_io [xfs]
       RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
       Call Trace:
        wake_up_page_bit+0x8a/0x110
        iomap_finish_ioend+0xd7/0x1c0
        iomap_finish_ioends+0x7f/0xb0
        xfs_end_ioend+0x6b/0x100 [xfs]
        xfs_end_io+0xb9/0xe0 [xfs]
        process_one_work+0x1a7/0x360
        worker_thread+0x1fa/0x390
        kthread+0x116/0x130
        ret_from_fork+0x35/0x40
      
      Ioends are processed as an atomic completion unit when all the
      chained bios in the ioend have completed their IO. Logically
      contiguous ioends can also be merged and completed as a single,
      larger unit.  Both of these things can be problematic as both the
      bio chains per ioend and the size of the merged ioends processed as
      a single completion are both unbound.
      
      If we have a large sequential dirty region in the page cache,
      write_cache_pages() will keep feeding us sequential pages and we
      will keep mapping them into ioends and bios until we get a dirty
      page at a non-sequential file offset. These large sequential runs
      can will result in bio and ioend chaining to optimise the io
      patterns. The pages iunder writeback are pinned within these chains
      until the submission chaining is broken, allowing the entire chain
      to be completed. This can result in huge chains being processed
      in IO completion context.
      
      We get deep bio chaining if we have large contiguous physical
      extents. We will keep adding pages to the current bio until it is
      full, then we'll chain a new bio to keep adding pages for writeback.
      Hence we can build bio chains that map millions of pages and tens of
      gigabytes of RAM if the page cache contains big enough contiguous
      dirty file regions. This long bio chain pins those pages until the
      final bio in the chain completes and the ioend can iterate all the
      chained bios and complete them.
      
      OTOH, if we have a physically fragmented file, we end up submitting
      one ioend per physical fragment that each have a small bio or bio
      chain attached to them. We do not chain these at IO submission time,
      but instead we chain them at completion time based on file
      offset via iomap_ioend_try_merge(). Hence we can end up with unbound
      ioend chains being built via completion merging.
      
      XFS can then do COW remapping or unwritten extent conversion on that
      merged chain, which involves walking an extent fragment at a time
      and running a transaction to modify the physical extent information.
      IOWs, we merge all the discontiguous ioends together into a
      contiguous file range, only to then process them individually as
      discontiguous extents.
      
      This extent manipulation is computationally expensive and can run in
      a tight loop, so merging logically contiguous but physically
      discontigous ioends gains us nothing except for hiding the fact the
      fact we broke the ioends up into individual physical extents at
      submission and then need to loop over those individual physical
      extents at completion.
      
      Hence we need to have mechanisms to limit ioend sizes and
      to break up completion processing of large merged ioend chains:
      
      1. bio chains per ioend need to be bound in length. Pure overwrites
      go straight to iomap_finish_ioend() in softirq context with the
      exact bio chain attached to the ioend by submission. Hence the only
      way to prevent long holdoffs here is to bound ioend submission
      sizes because we can't reschedule in softirq context.
      
      2. iomap_finish_ioends() has to handle unbound merged ioend chains
      correctly. This relies on any one call to iomap_finish_ioend() being
      bound in runtime so that cond_resched() can be issued regularly as
      the long ioend chain is processed. i.e. this relies on mechanism #1
      to limit individual ioend sizes to work correctly.
      
      3. filesystems have to loop over the merged ioends to process
      physical extent manipulations. This means they can loop internally,
      and so we break merging at physical extent boundaries so the
      filesystem can easily insert reschedule points between individual
      extent manipulations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reported-and-tested-by: NTrond Myklebust <trondmy@hammerspace.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Conflicts:
      	include/linux/iomap.h
      	fs/iomap/buffered-io.c
      	fs/xfs/xfs_aops.c
      
      	[ 6e552494 ("iomap: remove unused private field from ioend")
      	  is not applied.
      	  95c4cd05 ("iomap: Convert to_iomap_page to take a folio") is
      	  not applied.
      	  8ffd74e9 ("iomap: Convert bio completions to use folios") is
      	  not applied.
      	  044c6449 ("xfs: drop unused ioend private merge and
      	  setfilesize code") is not applied. ]
      Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      c5883137
  2. 18 1月, 2023 1 次提交
  3. 27 12月, 2021 2 次提交
    • B
      xfs: punch out data fork delalloc blocks on COW writeback failure · 22987db8
      Brian Foster 提交于
      mainline-inclusion
      from mainline-v5.15-rc4
      commit 5ca5916b
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5ca5916b6bc93577c360c06cb7cdf71adb9b5faf
      
      -------------------------------------------------
      
      If writeback I/O to a COW extent fails, the COW fork blocks are
      punched out and the data fork blocks left alone. It is possible for
      COW fork blocks to overlap non-shared data fork blocks (due to
      cowextsz hint prealloc), however, and writeback unconditionally maps
      to the COW fork whenever blocks exist at the corresponding offset of
      the page undergoing writeback. This means it's quite possible for a
      COW fork extent to overlap delalloc data fork blocks, writeback to
      convert and map to the COW fork blocks, writeback to fail, and
      finally for ioend completion to cancel the COW fork blocks and leave
      stale data fork delalloc blocks around in the inode. The blocks are
      effectively stale because writeback failure also discards dirty page
      state.
      
      If this occurs, it is likely to trigger assert failures, free space
      accounting corruption and failures in unrelated file operations. For
      example, a subsequent reflink attempt of the affected file to a new
      target file will trip over the stale delalloc in the source file and
      fail. Several of these issues are occasionally reproduced by
      generic/648, but are reproducible on demand with the right sequence
      of operations and timely I/O error injection.
      
      To fix this problem, update the ioend failure path to also punch out
      underlying data fork delalloc blocks on I/O error. This is analogous
      to the writeback submission failure path in xfs_discard_page() where
      we might fail to map data fork delalloc blocks and consistent with
      the successful COW writeback completion path, which is responsible
      for unmapping from the data fork and remapping in COW fork blocks.
      
      Fixes: 787eb485 ("xfs: fix and streamline error handling in xfs_end_io")
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NLihong Kou <koulihong@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      22987db8
    • B
      xfs: drop submit side trans alloc for append ioends · b906e741
      Brian Foster 提交于
      mainline-inclusion
      from mainline-v5.12-rc4
      commit 7cd3099f
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7cd3099f4925d7c15887d1940ebd65acd66100f5
      
      -------------------------------------------------
      
      Per-inode ioend completion batching has a log reservation deadlock
      vector between preallocated append transactions and transactions
      that are acquired at completion time for other purposes (i.e.,
      unwritten extent conversion or COW fork remaps). For example, if the
      ioend completion workqueue task executes on a batch of ioends that
      are sorted such that an append ioend sits at the tail, it's possible
      for the outstanding append transaction reservation to block
      allocation of transactions required to process preceding ioends in
      the list.
      
      Append ioend completion is historically the common path for on-disk
      inode size updates. While file extending writes may have completed
      sometime earlier, the on-disk inode size is only updated after
      successful writeback completion. These transactions are preallocated
      serially from writeback context to mitigate concurrency and
      associated log reservation pressure across completions processed by
      multi-threaded workqueue tasks.
      
      However, now that delalloc blocks unconditionally map to unwritten
      extents at physical block allocation time, size updates via append
      ioends are relatively rare. This means that inode size updates most
      commonly occur as part of the preexisting completion time
      transaction to convert unwritten extents. As a result, there is no
      longer a strong need to preallocate size update transactions.
      
      Remove the preallocation of inode size update transactions to avoid
      the ioend completion processing log reservation deadlock. Instead,
      continue to send all potential size extending ioends to workqueue
      context for completion and allocate the transaction from that
      context. This ensures that no outstanding log reservation is owned
      by the ioend completion worker task when it begins to process
      ioends.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NLihong Kou <koulihong@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      b906e741
  4. 05 11月, 2020 2 次提交
    • D
      xfs: fix missing CoW blocks writeback conversion retry · c2f09217
      Darrick J. Wong 提交于
      In commit 7588cbee, we tried to fix a race stemming from the lack of
      coordination between higher level code that wants to allocate and remap
      CoW fork extents into the data fork.  Christoph cites as examples the
      always_cow mode, and a directio write completion racing with writeback.
      
      According to the comments before the goto retry, we want to restart the
      lookup to catch the extent in the data fork, but we don't actually reset
      whichfork or cow_fsb, which means the second try executes using stale
      information.  Up until now I think we've gotten lucky that either
      there's something left in the CoW fork to cause cow_fsb to be reset, or
      either data/cow fork sequence numbers have advanced enough to force a
      fresh lookup from the data fork.  However, if we reach the retry with an
      empty stable CoW fork and a stable data fork, neither of those things
      happens.  The retry foolishly re-calls xfs_convert_blocks on the CoW
      fork which fails again.  This time, we toss the write.
      
      I've recently been working on extending reflink to the realtime device.
      When the realtime extent size is larger than a single block, we have to
      force the page cache to CoW the entire rt extent if a write (or
      fallocate) are not aligned with the rt extent size.  The strategy I've
      chosen to deal with this is derived from Dave's blocksize > pagesize
      series: dirtying around the write range, and ensuring that writeback
      always starts mapping on an rt extent boundary.  This has brought this
      race front and center, since generic/522 blows up immediately.
      
      However, I'm pretty sure this is a bug outright, independent of that.
      
      Fixes: 7588cbee ("xfs: retry COW fork delalloc conversion when no extent was found")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c2f09217
    • B
      iomap: support partial page discard on writeback block mapping failure · 763e4cdc
      Brian Foster 提交于
      iomap writeback mapping failure only calls into ->discard_page() if
      the current page has not been added to the ioend. Accordingly, the
      XFS callback assumes a full page discard and invalidation. This is
      problematic for sub-page block size filesystems where some portion
      of a page might have been mapped successfully before a failure to
      map a delalloc block occurs. ->discard_page() is not called in that
      error scenario and the bio is explicitly failed by iomap via the
      error return from ->prepare_ioend(). As a result, the filesystem
      leaks delalloc blocks and corrupts the filesystem block counters.
      
      Since XFS is the only user of ->discard_page(), tweak the semantics
      to invoke the callback unconditionally on mapping errors and provide
      the file offset that failed to map. Update xfs_discard_page() to
      discard the corresponding portion of the file and pass the range
      along to iomap_invalidatepage(). The latter already properly handles
      both full and sub-page scenarios by not changing any iomap or page
      state on sub-page invalidations.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      763e4cdc
  5. 21 9月, 2020 1 次提交
  6. 03 6月, 2020 1 次提交
  7. 20 5月, 2020 1 次提交
  8. 03 3月, 2020 1 次提交
  9. 04 1月, 2020 1 次提交
  10. 28 10月, 2019 1 次提交
  11. 22 10月, 2019 1 次提交
  12. 21 10月, 2019 6 次提交
  13. 01 7月, 2019 5 次提交
  14. 29 6月, 2019 3 次提交
  15. 17 6月, 2019 1 次提交
  16. 30 4月, 2019 1 次提交
  17. 17 4月, 2019 2 次提交
    • D
      xfs: merge adjacent io completions of the same type · 3994fc48
      Darrick J. Wong 提交于
      It's possible for pagecache writeback to split up a large amount of work
      into smaller pieces for throttling purposes or to reduce the amount of
      time a writeback operation is pending.  Whatever the reason, XFS can end
      up with a bunch of IO completions that call for the same operation to be
      performed on a contiguous extent mapping.  Since mappings are extent
      based in XFS, we'd prefer to run fewer transactions when we can.
      
      When we're processing an ioend on the list of io completions, check to
      see if the next items on the list are both adjacent and of the same
      type.  If so, we can merge the completions to reduce transaction
      overhead.
      
      On fast storage this doesn't seem to make much of a difference in
      performance, though the number of transactions for an overnight xfstests
      run seems to drop by ~5%.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      3994fc48
    • D
      xfs: implement per-inode writeback completion queues · cb357bf3
      Darrick J. Wong 提交于
      When scheduling writeback of dirty file data in the page cache, XFS uses
      IO completion workqueue items to ensure that filesystem metadata only
      updates after the write completes successfully.  This is essential for
      converting unwritten extents to real extents at the right time and
      performing COW remappings.
      
      Unfortunately, XFS queues each IO completion work item to an unbounded
      workqueue, which means that the kernel can spawn dozens of threads to
      try to handle the items quickly.  These threads need to take the ILOCK
      to update file metadata, which results in heavy ILOCK contention if a
      large number of the work items target a single file, which is
      inefficient.
      
      Worse yet, the writeback completion threads get stuck waiting for the
      ILOCK while holding transaction reservations, which can use up all
      available log reservation space.  When that happens, metadata updates to
      other parts of the filesystem grind to a halt, even if the filesystem
      could otherwise have handled it.
      
      Even worse, if one of the things grinding to a halt happens to be a
      thread in the middle of a defer-ops finish holding the same ILOCK and
      trying to obtain more log reservation having exhausted the permanent
      reservation, we now have an ABBA deadlock - writeback completion has a
      transaction reserved and wants the ILOCK, and someone else has the ILOCK
      and wants a transaction reservation.
      
      Therefore, we create a per-inode writeback io completion queue + work
      item.  When writeback finishes, it can add the ioend to the per-inode
      queue and let the single worker item process that queue.  This
      dramatically cuts down on the number of kworkers and ILOCK contention in
      the system, and seems to have eliminated an occasional deadlock I was
      seeing while running generic/476.
      
      Testing with a program that simulates a heavy random-write workload to a
      single file demonstrates that the number of kworkers drops from
      approximately 120 threads per file to 1, without dramatically changing
      write bandwidth or pagecache access latency.
      
      Note that we leave the xfs-conv workqueue's max_active alone because we
      still want to be able to run ioend processing for as many inodes as the
      system can handle.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      cb357bf3
  18. 21 2月, 2019 2 次提交
    • C
      xfs: introduce an always_cow mode · 66ae56a5
      Christoph Hellwig 提交于
      Add a mode where XFS never overwrites existing blocks in place.  This
      is to aid debugging our COW code, and also put infatructure in place
      for things like possible future support for zoned block devices, which
      can't support overwrites.
      
      This mode is enabled globally by doing a:
      
          echo 1 > /sys/fs/xfs/debug/always_cow
      
      Note that the parameter is global to allow running all tests in xfstests
      easily in this mode, which would not easily be possible with a per-fs
      sysfs file.
      
      In always_cow mode persistent preallocations are disabled, and fallocate
      will fail when called with a 0 mode (with our without
      FALLOC_FL_KEEP_SIZE), and not create unwritten extent for zeroed space
      when called with FALLOC_FL_ZERO_RANGE or FALLOC_FL_UNSHARE_RANGE.
      
      There are a few interesting xfstests failures when run in always_cow
      mode:
      
       - generic/392 fails because the bytes used in the file used to test
         hole punch recovery are less after the log replay.  This is
         because the blocks written and then punched out are only freed
         with a delay due to the logging mechanism.
       - xfs/170 will fail as the already fragile file streams mechanism
         doesn't seem to interact well with the COW allocator
       - xfs/180 xfs/182 xfs/192 xfs/198 xfs/204 and xfs/208 will claim
         the file system is badly fragmented, but there is not much we
         can do to avoid that when always writing out of place
       - xfs/205 fails because overwriting a file in always_cow mode
         will require new space allocation and the assumption in the
         test thus don't work anymore.
       - xfs/326 fails to modify the file at all in always_cow mode after
         injecting the refcount error, leading to an unexpected md5sum
         after the remount, but that again is expected
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      66ae56a5
    • C
      xfs: also truncate holes covered by COW blocks · 12df89f2
      Christoph Hellwig 提交于
      This only matters if we want to write data through the COW fork that is
      not actually an overwrite of existing data.  Reasons for that are
      speculative COW fork allocations using the cowextsize, or a mode where
      we always write through the COW fork.  Currently both can't actually
      happen, but I plan to enable them.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      12df89f2
  19. 18 2月, 2019 5 次提交
  20. 15 2月, 2019 2 次提交