1. 29 3月, 2023 1 次提交
    • D
      xfs, iomap: limit individual ioend chain lengths in writeback · c5883137
      Dave Chinner 提交于
      mainline inclusion
      from mainline-v5.17-rc3
      commit ebb7fb15
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ebb7fb1557b1d03b906b668aa2164b51e6b7d19a
      
      --------------------------------
      
      Trond Myklebust reported soft lockups in XFS IO completion such as
      this:
      
       watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [kworker/12:1:3106]
       CPU: 12 PID: 3106 Comm: kworker/12:1 Not tainted 4.18.0-305.10.2.el8_4.x86_64 #1
       Workqueue: xfs-conv/md127 xfs_end_io [xfs]
       RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
       Call Trace:
        wake_up_page_bit+0x8a/0x110
        iomap_finish_ioend+0xd7/0x1c0
        iomap_finish_ioends+0x7f/0xb0
        xfs_end_ioend+0x6b/0x100 [xfs]
        xfs_end_io+0xb9/0xe0 [xfs]
        process_one_work+0x1a7/0x360
        worker_thread+0x1fa/0x390
        kthread+0x116/0x130
        ret_from_fork+0x35/0x40
      
      Ioends are processed as an atomic completion unit when all the
      chained bios in the ioend have completed their IO. Logically
      contiguous ioends can also be merged and completed as a single,
      larger unit.  Both of these things can be problematic as both the
      bio chains per ioend and the size of the merged ioends processed as
      a single completion are both unbound.
      
      If we have a large sequential dirty region in the page cache,
      write_cache_pages() will keep feeding us sequential pages and we
      will keep mapping them into ioends and bios until we get a dirty
      page at a non-sequential file offset. These large sequential runs
      can will result in bio and ioend chaining to optimise the io
      patterns. The pages iunder writeback are pinned within these chains
      until the submission chaining is broken, allowing the entire chain
      to be completed. This can result in huge chains being processed
      in IO completion context.
      
      We get deep bio chaining if we have large contiguous physical
      extents. We will keep adding pages to the current bio until it is
      full, then we'll chain a new bio to keep adding pages for writeback.
      Hence we can build bio chains that map millions of pages and tens of
      gigabytes of RAM if the page cache contains big enough contiguous
      dirty file regions. This long bio chain pins those pages until the
      final bio in the chain completes and the ioend can iterate all the
      chained bios and complete them.
      
      OTOH, if we have a physically fragmented file, we end up submitting
      one ioend per physical fragment that each have a small bio or bio
      chain attached to them. We do not chain these at IO submission time,
      but instead we chain them at completion time based on file
      offset via iomap_ioend_try_merge(). Hence we can end up with unbound
      ioend chains being built via completion merging.
      
      XFS can then do COW remapping or unwritten extent conversion on that
      merged chain, which involves walking an extent fragment at a time
      and running a transaction to modify the physical extent information.
      IOWs, we merge all the discontiguous ioends together into a
      contiguous file range, only to then process them individually as
      discontiguous extents.
      
      This extent manipulation is computationally expensive and can run in
      a tight loop, so merging logically contiguous but physically
      discontigous ioends gains us nothing except for hiding the fact the
      fact we broke the ioends up into individual physical extents at
      submission and then need to loop over those individual physical
      extents at completion.
      
      Hence we need to have mechanisms to limit ioend sizes and
      to break up completion processing of large merged ioend chains:
      
      1. bio chains per ioend need to be bound in length. Pure overwrites
      go straight to iomap_finish_ioend() in softirq context with the
      exact bio chain attached to the ioend by submission. Hence the only
      way to prevent long holdoffs here is to bound ioend submission
      sizes because we can't reschedule in softirq context.
      
      2. iomap_finish_ioends() has to handle unbound merged ioend chains
      correctly. This relies on any one call to iomap_finish_ioend() being
      bound in runtime so that cond_resched() can be issued regularly as
      the long ioend chain is processed. i.e. this relies on mechanism #1
      to limit individual ioend sizes to work correctly.
      
      3. filesystems have to loop over the merged ioends to process
      physical extent manipulations. This means they can loop internally,
      and so we break merging at physical extent boundaries so the
      filesystem can easily insert reschedule points between individual
      extent manipulations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reported-and-tested-by: NTrond Myklebust <trondmy@hammerspace.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Conflicts:
      	include/linux/iomap.h
      	fs/iomap/buffered-io.c
      	fs/xfs/xfs_aops.c
      
      	[ 6e552494 ("iomap: remove unused private field from ioend")
      	  is not applied.
      	  95c4cd05 ("iomap: Convert to_iomap_page to take a folio") is
      	  not applied.
      	  8ffd74e9 ("iomap: Convert bio completions to use folios") is
      	  not applied.
      	  044c6449 ("xfs: drop unused ioend private merge and
      	  setfilesize code") is not applied. ]
      Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      c5883137
  2. 18 1月, 2023 1 次提交
  3. 27 9月, 2022 1 次提交
  4. 15 11月, 2021 1 次提交
  5. 30 10月, 2021 1 次提交
  6. 21 10月, 2021 1 次提交
  7. 05 11月, 2020 2 次提交
    • B
      iomap: clean up writeback state logic on writepage error · 50e7d6c7
      Brian Foster 提交于
      The iomap writepage error handling logic is a mash of old and
      slightly broken XFS writepage logic. When keepwrite writeback state
      tracking was introduced in XFS in commit 0d085a52 ("xfs: ensure
      WB_SYNC_ALL writeback handles partial pages correctly"), XFS had an
      additional cluster writeback context that scanned ahead of
      ->writepage() to process dirty pages over the current ->writepage()
      extent mapping. This context expected a dirty page and required
      retention of the TOWRITE tag on partial page processing so the
      higher level writeback context would revisit the page (in contrast
      to ->writepage(), which passes a page with the dirty bit already
      cleared).
      
      The cluster writeback mechanism was eventually removed and some of
      the error handling logic folded into the primary writeback path in
      commit 150d5be0 ("xfs: remove xfs_cancel_ioend"). This patch
      accidentally conflated the two contexts by using the keepwrite logic
      in ->writepage() without accounting for the fact that the page is
      not dirty. Further, the keepwrite logic has no practical effect on
      the core ->writepage() caller (write_cache_pages()) because it never
      revisits a page in the current function invocation.
      
      Technically, the page should be redirtied for the keepwrite logic to
      have any effect. Otherwise, write_cache_pages() may find the tagged
      page but will skip it since it is clean. Even if the page was
      redirtied, however, there is still no practical effect to keepwrite
      since write_cache_pages() does not wrap around within a single
      invocation of the function. Therefore, the dirty page would simply
      end up retagged on the next writeback sequence over the associated
      range.
      
      All that being said, none of this really matters because redirtying
      a partially processed page introduces a potential infinite redirty
      -> writeback failure loop that deviates from the current design
      principle of clearing the dirty state on writepage failure to avoid
      building up too much dirty, unreclaimable memory on the system.
      Therefore, drop the spurious keepwrite usage and dirty state
      clearing logic from iomap_writepage_map(), treat the partially
      processed page the same as a fully processed page, and let the
      imminent ioend failure clean up the writeback state.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      50e7d6c7
    • B
      iomap: support partial page discard on writeback block mapping failure · 763e4cdc
      Brian Foster 提交于
      iomap writeback mapping failure only calls into ->discard_page() if
      the current page has not been added to the ioend. Accordingly, the
      XFS callback assumes a full page discard and invalidation. This is
      problematic for sub-page block size filesystems where some portion
      of a page might have been mapped successfully before a failure to
      map a delalloc block occurs. ->discard_page() is not called in that
      error scenario and the bio is explicitly failed by iomap via the
      error return from ->prepare_ioend(). As a result, the filesystem
      leaks delalloc blocks and corrupts the filesystem block counters.
      
      Since XFS is the only user of ->discard_page(), tweak the semantics
      to invoke the callback unconditionally on mapping errors and provide
      the file offset that failed to map. Update xfs_discard_page() to
      discard the corresponding portion of the file and pass the range
      along to iomap_invalidatepage(). The latter already properly handles
      both full and sub-page scenarios by not changing any iomap or page
      state on sub-page invalidations.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      763e4cdc
  8. 28 9月, 2020 1 次提交
  9. 21 9月, 2020 10 次提交
  10. 10 9月, 2020 2 次提交
  11. 09 6月, 2020 1 次提交
  12. 03 6月, 2020 3 次提交
  13. 03 4月, 2020 2 次提交
  14. 05 3月, 2020 1 次提交
  15. 07 1月, 2020 1 次提交
    • A
      fs: Fix page_mkwrite off-by-one errors · 243145bc
      Andreas Gruenbacher 提交于
      The check in block_page_mkwrite that is meant to determine whether an
      offset is within the inode size is off by one.  This bug has been copied
      into iomap_page_mkwrite and several filesystems (ubifs, ext4, f2fs,
      ceph).
      
      Fix that by introducing a new page_mkwrite_check_truncate helper that
      checks for truncate and computes the bytes in the page up to EOF.  Use
      the helper in iomap.
      
      NOTE from Darrick: The original patch fixed a number of filesystems, but
      then there were merge conflicts with the f2fs for-next tree; a
      subsequent re-submission of the patch had different btrfs changes with
      no explanation; and Christoph complained that each per-fs fix should be
      a separate patch.  In my view that's too much risk to take on, so I
      decided to drop all the hunks except for iomap, since I've actually QA'd
      XFS.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      [darrick: drop everything but the iomap parts]
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      243145bc
  16. 05 12月, 2019 2 次提交
    • Z
      iomap: stop using ioend after it's been freed in iomap_finish_ioend() · c275779f
      Zorro Lang 提交于
      This patch fixes the following KASAN report. The @ioend has been
      freed by dio_put(), but the iomap_finish_ioend() still trys to access
      its data.
      
      [20563.631624] BUG: KASAN: use-after-free in iomap_finish_ioend+0x58c/0x5c0
      [20563.638319] Read of size 8 at addr fffffc0c54a36928 by task kworker/123:2/22184
      
      [20563.647107] CPU: 123 PID: 22184 Comm: kworker/123:2 Not tainted 5.4.0+ #1
      [20563.653887] Hardware name: HPE Apollo 70             /C01_APACHE_MB         , BIOS L50_5.13_1.11 06/18/2019
      [20563.664499] Workqueue: xfs-conv/sda5 xfs_end_io [xfs]
      [20563.669547] Call trace:
      [20563.671993]  dump_backtrace+0x0/0x370
      [20563.675648]  show_stack+0x1c/0x28
      [20563.678958]  dump_stack+0x138/0x1b0
      [20563.682455]  print_address_description.isra.9+0x60/0x378
      [20563.687759]  __kasan_report+0x1a4/0x2a8
      [20563.691587]  kasan_report+0xc/0x18
      [20563.694985]  __asan_report_load8_noabort+0x18/0x20
      [20563.699769]  iomap_finish_ioend+0x58c/0x5c0
      [20563.703944]  iomap_finish_ioends+0x110/0x270
      [20563.708396]  xfs_end_ioend+0x168/0x598 [xfs]
      [20563.712823]  xfs_end_io+0x1e0/0x2d0 [xfs]
      [20563.716834]  process_one_work+0x7f0/0x1ac8
      [20563.720922]  worker_thread+0x334/0xae0
      [20563.724664]  kthread+0x2c4/0x348
      [20563.727889]  ret_from_fork+0x10/0x18
      
      [20563.732941] Allocated by task 83403:
      [20563.736512]  save_stack+0x24/0xb0
      [20563.739820]  __kasan_kmalloc.isra.9+0xc4/0xe0
      [20563.744169]  kasan_slab_alloc+0x14/0x20
      [20563.747998]  slab_post_alloc_hook+0x50/0xa8
      [20563.752173]  kmem_cache_alloc+0x154/0x330
      [20563.756185]  mempool_alloc_slab+0x20/0x28
      [20563.760186]  mempool_alloc+0xf4/0x2a8
      [20563.763845]  bio_alloc_bioset+0x2d0/0x448
      [20563.767849]  iomap_writepage_map+0x4b8/0x1740
      [20563.772198]  iomap_do_writepage+0x200/0x8d0
      [20563.776380]  write_cache_pages+0x8a4/0xed8
      [20563.780469]  iomap_writepages+0x4c/0xb0
      [20563.784463]  xfs_vm_writepages+0xf8/0x148 [xfs]
      [20563.788989]  do_writepages+0xc8/0x218
      [20563.792658]  __writeback_single_inode+0x168/0x18f8
      [20563.797441]  writeback_sb_inodes+0x370/0xd30
      [20563.801703]  wb_writeback+0x2d4/0x1270
      [20563.805446]  wb_workfn+0x344/0x1178
      [20563.808928]  process_one_work+0x7f0/0x1ac8
      [20563.813016]  worker_thread+0x334/0xae0
      [20563.816757]  kthread+0x2c4/0x348
      [20563.819979]  ret_from_fork+0x10/0x18
      
      [20563.825028] Freed by task 22184:
      [20563.828251]  save_stack+0x24/0xb0
      [20563.831559]  __kasan_slab_free+0x10c/0x180
      [20563.835648]  kasan_slab_free+0x10/0x18
      [20563.839389]  slab_free_freelist_hook+0xb4/0x1c0
      [20563.843912]  kmem_cache_free+0x8c/0x3e8
      [20563.847745]  mempool_free_slab+0x20/0x28
      [20563.851660]  mempool_free+0xd4/0x2f8
      [20563.855231]  bio_free+0x33c/0x518
      [20563.858537]  bio_put+0xb8/0x100
      [20563.861672]  iomap_finish_ioend+0x168/0x5c0
      [20563.865847]  iomap_finish_ioends+0x110/0x270
      [20563.870328]  xfs_end_ioend+0x168/0x598 [xfs]
      [20563.874751]  xfs_end_io+0x1e0/0x2d0 [xfs]
      [20563.878755]  process_one_work+0x7f0/0x1ac8
      [20563.882844]  worker_thread+0x334/0xae0
      [20563.886584]  kthread+0x2c4/0x348
      [20563.889804]  ret_from_fork+0x10/0x18
      
      [20563.894855] The buggy address belongs to the object at fffffc0c54a36900
                      which belongs to the cache bio-1 of size 248
      [20563.906844] The buggy address is located 40 bytes inside of
                      248-byte region [fffffc0c54a36900, fffffc0c54a369f8)
      [20563.918485] The buggy address belongs to the page:
      [20563.923269] page:ffffffff82f528c0 refcount:1 mapcount:0 mapping:fffffc8e4ba31900 index:0xfffffc0c54a33300
      [20563.932832] raw: 17ffff8000000200 ffffffffa3060100 0000000700000007 fffffc8e4ba31900
      [20563.940567] raw: fffffc0c54a33300 0000000080aa0042 00000001ffffffff 0000000000000000
      [20563.948300] page dumped because: kasan: bad access detected
      
      [20563.955345] Memory state around the buggy address:
      [20563.960129]  fffffc0c54a36800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc
      [20563.967342]  fffffc0c54a36880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [20563.974554] >fffffc0c54a36900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [20563.981766]                                   ^
      [20563.986288]  fffffc0c54a36980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc
      [20563.993501]  fffffc0c54a36a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      [20564.000713] ==================================================================
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205703Signed-off-by: NZorro Lang <zlang@redhat.com>
      Fixes: 9cd0ed63 ("iomap: enhance writeback error message")
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c275779f
    • C
      iomap: fix sub-page uptodate handling · 1cea335d
      Christoph Hellwig 提交于
      bio completions can race when a page spans more than one file system
      block.  Add a spinlock to synchronize marking the page uptodate.
      
      Fixes: 9dc55f13 ("iomap: add support for sub-pagesize buffered I/O without buffer heads")
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      1cea335d
  17. 07 11月, 2019 1 次提交
  18. 21 10月, 2019 8 次提交