1. 29 11月, 2022 3 次提交
    • D
      xfs: drop write error injection is unfixable, remove it · 6e8af15c
      Dave Chinner 提交于
      With the changes to scan the page cache for dirty data to avoid data
      corruptions from partial write cleanup racing with other page cache
      operations, the drop writes error injection no longer works the same
      way it used to and causes xfs/196 to fail. This is because xfs/196
      writes to the file and populates the page cache before it turns on
      the error injection and starts failing -overwrites-.
      
      The result is that the original drop-writes code failed writes only
      -after- overwriting the data in the cache, followed by invalidates
      the cached data, then punching out the delalloc extent from under
      that data.
      
      On the surface, this looks fine. The problem is that page cache
      invalidation *doesn't guarantee that it removes anything from the
      page cache* and it doesn't change the dirty state of the folio. When
      block size == page size and we do page aligned IO (as xfs/196 does)
      everything happens to align perfectly and page cache invalidation
      removes the single page folios that span the written data. Hence the
      followup delalloc punch pass does not find cached data over that
      range and it can punch the extent out.
      
      IOWs, xfs/196 "works" for block size == page size with the new
      code. I say "works", because it actually only works for the case
      where IO is page aligned, and no data was read from disk before
      writes occur. Because the moment we actually read data first, the
      readahead code allocates multipage folios and suddenly the
      invalidate code goes back to zeroing subfolio ranges without
      changing dirty state.
      
      Hence, with multipage folios in play, block size == page size is
      functionally identical to block size < page size behaviour, and
      drop-writes is manifestly broken w.r.t to this case. Invalidation of
      a subfolio range doesn't result in the folio being removed from the
      cache, just the range gets zeroed. Hence after we've sequentially
      walked over a folio that we've dirtied (via write data) and then
      invalidated, we end up with a dirty folio full of zeroed data.
      
      And because the new code skips punching ranges that have dirty
      folios covering them, we end up leaving the delalloc range intact
      after failing all the writes. Hence failed writes now end up
      writing zeroes to disk in the cases where invalidation zeroes folios
      rather than removing them from cache.
      
      This is a fundamental change of behaviour that is needed to avoid
      the data corruption vectors that exist in the old write fail path,
      and it renders the drop-writes injection non-functional and
      unworkable as it stands.
      
      As it is, I think the error injection is also now unnecessary, as
      partial writes that need delalloc extent are going to be a lot more
      common with stale iomap detection in place. Hence this patch removes
      the drop-writes error injection completely. xfs/196 can remain for
      testing kernels that don't have this data corruption fix, but those
      that do will report:
      
      xfs/196 3s ... [not run] XFS error injection drop_writes unknown on this kernel.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      6e8af15c
    • D
      xfs: use iomap_valid method to detect stale cached iomaps · 304a68b9
      Dave Chinner 提交于
      Now that iomap supports a mechanism to validate cached iomaps for
      buffered write operations, hook it up to the XFS buffered write ops
      so that we can avoid data corruptions that result from stale cached
      iomaps. See:
      
      https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
      
      or the ->iomap_valid() introduction commit for exact details of the
      corruption vector.
      
      The validity cookie we store in the iomap is based on the type of
      iomap we return. It is expected that the iomap->flags we set in
      xfs_bmbt_to_iomap() is not perturbed by the iomap core and are
      returned to us in the iomap passed via the .iomap_valid() callback.
      This ensures that the validity cookie is always checking the correct
      inode fork sequence numbers to detect potential changes that affect
      the extent cached by the iomap.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      304a68b9
    • D
      xfs: xfs_bmap_punch_delalloc_range() should take a byte range · 7348b322
      Dave Chinner 提交于
      All the callers of xfs_bmap_punch_delalloc_range() jump through
      hoops to convert a byte range to filesystem blocks before calling
      xfs_bmap_punch_delalloc_range(). Instead, pass the byte range to
      xfs_bmap_punch_delalloc_range() and have it do the conversion to
      filesystem blocks internally.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      7348b322
  2. 23 11月, 2022 3 次提交
    • D
      xfs,iomap: move delalloc punching to iomap · 9c7babf9
      Dave Chinner 提交于
      Because that's what Christoph wants for this error handling path
      only XFS uses.
      
      It requires a new iomap export for handling errors over delalloc
      ranges. This is basically the XFS code as is stands, but even though
      Christoph wants this as iomap funcitonality, we still have 
      to call it from the filesystem specific ->iomap_end callback, and
      call into the iomap code with yet another filesystem specific
      callback to punch the delalloc extent within the defined ranges.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      9c7babf9
    • D
      xfs: use byte ranges for write cleanup ranges · b71f889c
      Dave Chinner 提交于
      xfs_buffered_write_iomap_end() currently converts the byte ranges
      passed to it to filesystem blocks to pass them to the bmap code to
      punch out delalloc blocks, but then has to convert filesytem
      blocks back to byte ranges for page cache truncate.
      
      We're about to make the page cache truncate go away and replace it
      with a page cache walk, so having to convert everything to/from/to
      filesystem blocks is messy and error-prone. It is much easier to
      pass around byte ranges and convert to page indexes and/or
      filesystem blocks only where those units are needed.
      
      In preparation for the page cache walk being added, add a helper
      that converts byte ranges to filesystem blocks and calls
      xfs_bmap_punch_delalloc_range() and convert
      xfs_buffered_write_iomap_end() to calculate limits in byte ranges.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      b71f889c
    • D
      xfs: punching delalloc extents on write failure is racy · 198dd8ae
      Dave Chinner 提交于
      xfs_buffered_write_iomap_end() has a comment about the safety of
      punching delalloc extents based holding the IOLOCK_EXCL. This
      comment is wrong, and punching delalloc extents is not race free.
      
      When we punch out a delalloc extent after a write failure in
      xfs_buffered_write_iomap_end(), we punch out the page cache with
      truncate_pagecache_range() before we punch out the delalloc extents.
      At this point, we only hold the IOLOCK_EXCL, so there is nothing
      stopping mmap() write faults racing with this cleanup operation,
      reinstantiating a folio over the range we are about to punch and
      hence requiring the delalloc extent to be kept.
      
      If this race condition is hit, we can end up with a dirty page in
      the page cache that has no delalloc extent or space reservation
      backing it. This leads to bad things happening at writeback time.
      
      To avoid this race condition, we need the page cache truncation to
      be atomic w.r.t. the extent manipulation. We can do this by holding
      the mapping->invalidate_lock exclusively across this operation -
      this will prevent new pages from being inserted into the page cache
      whilst we are removing the pages and the backing extent and space
      reservation.
      
      Taking the mapping->invalidate_lock exclusively in the buffered
      write IO path is safe - it naturally nests inside the IOLOCK (see
      truncate and fallocate paths). iomap_zero_range() can be called from
      under the mapping->invalidate_lock (from the truncate path via
      either xfs_zero_eof() or xfs_truncate_page(), but iomap_zero_iter()
      will not instantiate new delalloc pages (because it skips holes) and
      hence will not ever need to punch out delalloc extents on failure.
      
      Fix the locking issue, and clean up the code logic a little to avoid
      unnecessary work if we didn't allocate the delalloc extent or wrote
      the entire region we allocated.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      198dd8ae
  3. 07 11月, 2022 1 次提交
    • D
      xfs: write page faults in iomap are not buffered writes · 118e021b
      Dave Chinner 提交于
      When we reserve a delalloc region in xfs_buffered_write_iomap_begin,
      we mark the iomap as IOMAP_F_NEW so that the the write context
      understands that it allocated the delalloc region.
      
      If we then fail that buffered write, xfs_buffered_write_iomap_end()
      checks for the IOMAP_F_NEW flag and if it is set, it punches out
      the unused delalloc region that was allocated for the write.
      
      The assumption this code makes is that all buffered write operations
      that can allocate space are run under an exclusive lock (i_rwsem).
      This is an invalid assumption: page faults in mmap()d regions call
      through this same function pair to map the file range being faulted
      and this runs only holding the inode->i_mapping->invalidate_lock in
      shared mode.
      
      IOWs, we can have races between page faults and write() calls that
      fail the nested page cache write operation that result in data loss.
      That is, the failing iomap_end call will punch out the data that
      the other racing iomap iteration brought into the page cache. This
      can be reproduced with generic/34[46] if we arbitrarily fail page
      cache copy-in operations from write() syscalls.
      
      Code analysis tells us that the iomap_page_mkwrite() function holds
      the already instantiated and uptodate folio locked across the iomap
      mapping iterations. Hence the folio cannot be removed from memory
      whilst we are mapping the range it covers, and as such we do not
      care if the mapping changes state underneath the iomap iteration
      loop:
      
      1. if the folio is not already dirty, there is no writeback races
         possible.
      2. if we allocated the mapping (delalloc or unwritten), the folio
         cannot already be dirty. See #1.
      3. If the folio is already dirty, it must be up to date. As we hold
         it locked, it cannot be reclaimed from memory. Hence we always
         have valid data in the page cache while iterating the mapping.
      4. Valid data in the page cache can exist when the underlying
         mapping is DELALLOC, UNWRITTEN or WRITTEN. Having the mapping
         change from DELALLOC->UNWRITTEN or UNWRITTEN->WRITTEN does not
         change the data in the page - it only affects actions if we are
         initialising a new page. Hence #3 applies  and we don't care
         about these extent map transitions racing with
         iomap_page_mkwrite().
      5. iomap_page_mkwrite() checks for page invalidation races
         (truncate, hole punch, etc) after it locks the folio. We also
         hold the mapping->invalidation_lock here, and hence the mapping
         cannot change due to extent removal operations while we are
         iterating the folio.
      
      As such, filesystems that don't use bufferheads will never fail
      the iomap_folio_mkwrite_iter() operation on the current mapping,
      regardless of whether the iomap should be considered stale.
      
      Further, the range we are asked to iterate is limited to the range
      inside EOF that the folio spans. Hence, for XFS, we will only map
      the exact range we are asked for, and we will only do speculative
      preallocation with delalloc if we are mapping a hole at the EOF
      page. The iterator will consume the entire range of the folio that
      is within EOF, and anything beyond the EOF block cannot be accessed.
      We never need to truncate this post-EOF speculative prealloc away in
      the context of the iomap_page_mkwrite() iterator because if it
      remains unused we'll remove it when the last reference to the inode
      goes away.
      
      Hence we don't actually need an .iomap_end() cleanup/error handling
      path at all for iomap_page_mkwrite() for XFS. This means we can
      separate the page fault processing from the complexity of the
      .iomap_end() processing in the buffered write path. This also means
      that the buffered write path will also be able to take the
      mapping->invalidate_lock as necessary.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      118e021b
  4. 25 7月, 2022 2 次提交
  5. 18 7月, 2022 1 次提交
  6. 13 7月, 2022 1 次提交
  7. 10 7月, 2022 2 次提交
    • D
      xfs: make inode attribute forks a permanent part of struct xfs_inode · 2ed5b09b
      Darrick J. Wong 提交于
      Syzkaller reported a UAF bug a while back:
      
      ==================================================================
      BUG: KASAN: use-after-free in xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
      Read of size 4 at addr ffff88802cec919c by task syz-executor262/2958
      
      CPU: 2 PID: 2958 Comm: syz-executor262 Not tainted
      5.15.0-0.30.3-20220406_1406 #3
      Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29
      04/01/2014
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x82/0xa9 lib/dump_stack.c:106
       print_address_description.constprop.9+0x21/0x2d5 mm/kasan/report.c:256
       __kasan_report mm/kasan/report.c:442 [inline]
       kasan_report.cold.14+0x7f/0x11b mm/kasan/report.c:459
       xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
       xfs_attr_get+0x378/0x4c2 fs/xfs/libxfs/xfs_attr.c:159
       xfs_xattr_get+0xe3/0x150 fs/xfs/xfs_xattr.c:36
       __vfs_getxattr+0xdf/0x13d fs/xattr.c:399
       cap_inode_need_killpriv+0x41/0x5d security/commoncap.c:300
       security_inode_need_killpriv+0x4c/0x97 security/security.c:1408
       dentry_needs_remove_privs.part.28+0x21/0x63 fs/inode.c:1912
       dentry_needs_remove_privs+0x80/0x9e fs/inode.c:1908
       do_truncate+0xc3/0x1e0 fs/open.c:56
       handle_truncate fs/namei.c:3084 [inline]
       do_open fs/namei.c:3432 [inline]
       path_openat+0x30ab/0x396d fs/namei.c:3561
       do_filp_open+0x1c4/0x290 fs/namei.c:3588
       do_sys_openat2+0x60d/0x98c fs/open.c:1212
       do_sys_open+0xcf/0x13c fs/open.c:1228
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0x0
      RIP: 0033:0x7f7ef4bb753d
      Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48
      89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73
      01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48
      RSP: 002b:00007f7ef52c2ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
      RAX: ffffffffffffffda RBX: 0000000000404148 RCX: 00007f7ef4bb753d
      RDX: 00007f7ef4bb753d RSI: 0000000000000000 RDI: 0000000020004fc0
      RBP: 0000000000404140 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0030656c69662f2e
      R13: 00007ffd794db37f R14: 00007ffd794db470 R15: 00007f7ef52c2fc0
       </TASK>
      
      Allocated by task 2953:
       kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
       kasan_set_track mm/kasan/common.c:46 [inline]
       set_alloc_info mm/kasan/common.c:434 [inline]
       __kasan_slab_alloc+0x68/0x7c mm/kasan/common.c:467
       kasan_slab_alloc include/linux/kasan.h:254 [inline]
       slab_post_alloc_hook mm/slab.h:519 [inline]
       slab_alloc_node mm/slub.c:3213 [inline]
       slab_alloc mm/slub.c:3221 [inline]
       kmem_cache_alloc+0x11b/0x3eb mm/slub.c:3226
       kmem_cache_zalloc include/linux/slab.h:711 [inline]
       xfs_ifork_alloc+0x25/0xa2 fs/xfs/libxfs/xfs_inode_fork.c:287
       xfs_bmap_add_attrfork+0x3f2/0x9b1 fs/xfs/libxfs/xfs_bmap.c:1098
       xfs_attr_set+0xe38/0x12a7 fs/xfs/libxfs/xfs_attr.c:746
       xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
       __vfs_setxattr+0x11b/0x177 fs/xattr.c:180
       __vfs_setxattr_noperm+0x128/0x5e0 fs/xattr.c:214
       __vfs_setxattr_locked+0x1d4/0x258 fs/xattr.c:275
       vfs_setxattr+0x154/0x33d fs/xattr.c:301
       setxattr+0x216/0x29f fs/xattr.c:575
       __do_sys_fsetxattr fs/xattr.c:632 [inline]
       __se_sys_fsetxattr fs/xattr.c:621 [inline]
       __x64_sys_fsetxattr+0x243/0x2fe fs/xattr.c:621
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0x0
      
      Freed by task 2949:
       kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
       kasan_set_track+0x1c/0x21 mm/kasan/common.c:46
       kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360
       ____kasan_slab_free mm/kasan/common.c:366 [inline]
       ____kasan_slab_free mm/kasan/common.c:328 [inline]
       __kasan_slab_free+0xe2/0x10e mm/kasan/common.c:374
       kasan_slab_free include/linux/kasan.h:230 [inline]
       slab_free_hook mm/slub.c:1700 [inline]
       slab_free_freelist_hook mm/slub.c:1726 [inline]
       slab_free mm/slub.c:3492 [inline]
       kmem_cache_free+0xdc/0x3ce mm/slub.c:3508
       xfs_attr_fork_remove+0x8d/0x132 fs/xfs/libxfs/xfs_attr_leaf.c:773
       xfs_attr_sf_removename+0x5dd/0x6cb fs/xfs/libxfs/xfs_attr_leaf.c:822
       xfs_attr_remove_iter+0x68c/0x805 fs/xfs/libxfs/xfs_attr.c:1413
       xfs_attr_remove_args+0xb1/0x10d fs/xfs/libxfs/xfs_attr.c:684
       xfs_attr_set+0xf1e/0x12a7 fs/xfs/libxfs/xfs_attr.c:802
       xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
       __vfs_removexattr+0x106/0x16a fs/xattr.c:468
       cap_inode_killpriv+0x24/0x47 security/commoncap.c:324
       security_inode_killpriv+0x54/0xa1 security/security.c:1414
       setattr_prepare+0x1a6/0x897 fs/attr.c:146
       xfs_vn_change_ok+0x111/0x15e fs/xfs/xfs_iops.c:682
       xfs_vn_setattr_size+0x5f/0x15a fs/xfs/xfs_iops.c:1065
       xfs_vn_setattr+0x125/0x2ad fs/xfs/xfs_iops.c:1093
       notify_change+0xae5/0x10a1 fs/attr.c:410
       do_truncate+0x134/0x1e0 fs/open.c:64
       handle_truncate fs/namei.c:3084 [inline]
       do_open fs/namei.c:3432 [inline]
       path_openat+0x30ab/0x396d fs/namei.c:3561
       do_filp_open+0x1c4/0x290 fs/namei.c:3588
       do_sys_openat2+0x60d/0x98c fs/open.c:1212
       do_sys_open+0xcf/0x13c fs/open.c:1228
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0x0
      
      The buggy address belongs to the object at ffff88802cec9188
       which belongs to the cache xfs_ifork of size 40
      The buggy address is located 20 bytes inside of
       40-byte region [ffff88802cec9188, ffff88802cec91b0)
      The buggy address belongs to the page:
      page:00000000c3af36a1 refcount:1 mapcount:0 mapping:0000000000000000
      index:0x0 pfn:0x2cec9
      flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff)
      raw: 000fffffc0000200 ffffea00009d2580 0000000600000006 ffff88801a9ffc80
      raw: 0000000000000000 0000000080490049 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88802cec9080: fb fb fb fc fc fa fb fb fb fb fc fc fb fb fb fb
       ffff88802cec9100: fb fc fc fb fb fb fb fb fc fc fb fb fb fb fb fc
      >ffff88802cec9180: fc fa fb fb fb fb fc fc fa fb fb fb fb fc fc fb
                                  ^
       ffff88802cec9200: fb fb fb fb fc fc fb fb fb fb fb fc fc fb fb fb
       ffff88802cec9280: fb fb fc fc fa fb fb fb fb fc fc fa fb fb fb fb
      ==================================================================
      
      The root cause of this bug is the unlocked access to xfs_inode.i_afp
      from the getxattr code paths while trying to determine which ILOCK mode
      to use to stabilize the xattr data.  Unfortunately, the VFS does not
      acquire i_rwsem when vfs_getxattr (or listxattr) call into the
      filesystem, which means that getxattr can race with a removexattr that's
      tearing down the attr fork and crash:
      
      xfs_attr_set:                          xfs_attr_get:
      xfs_attr_fork_remove:                  xfs_ilock_attr_map_shared:
      
      xfs_idestroy_fork(ip->i_afp);
      kmem_cache_free(xfs_ifork_cache, ip->i_afp);
      
                                             if (ip->i_afp &&
      
      ip->i_afp = NULL;
      
                                                 xfs_need_iread_extents(ip->i_afp))
                                             <KABOOM>
      
      ip->i_forkoff = 0;
      
      Regrettably, the VFS is much more lax about i_rwsem and getxattr than
      is immediately obvious -- not only does it not guarantee that we hold
      i_rwsem, it actually doesn't guarantee that we *don't* hold it either.
      The getxattr system call won't acquire the lock before calling XFS, but
      the file capabilities code calls getxattr with and without i_rwsem held
      to determine if the "security.capabilities" xattr is set on the file.
      
      Fixing the VFS locking requires a treewide investigation into every code
      path that could touch an xattr and what i_rwsem state it expects or sets
      up.  That could take years or even prove impossible; fortunately, we
      can fix this UAF problem inside XFS.
      
      An earlier version of this patch used smp_wmb in xfs_attr_fork_remove to
      ensure that i_forkoff is always zeroed before i_afp is set to null and
      changed the read paths to use smp_rmb before accessing i_forkoff and
      i_afp, which avoided these UAF problems.  However, the patch author was
      too busy dealing with other problems in the meantime, and by the time he
      came back to this issue, the situation had changed a bit.
      
      On a modern system with selinux, each inode will always have at least
      one xattr for the selinux label, so it doesn't make much sense to keep
      incurring the extra pointer dereference.  Furthermore, Allison's
      upcoming parent pointer patchset will also cause nearly every inode in
      the filesystem to have extended attributes.  Therefore, make the inode
      attribute fork structure part of struct xfs_inode, at a cost of 40 more
      bytes.
      
      This patch adds a clunky if_present field where necessary to maintain
      the existing logic of xattr fork null pointer testing in the existing
      codebase.  The next patch switches the logic over to XFS_IFORK_Q and it
      all goes away.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      2ed5b09b
    • D
      xfs: convert XFS_IFORK_PTR to a static inline helper · 732436ef
      Darrick J. Wong 提交于
      We're about to make this logic do a bit more, so convert the macro to a
      static inline function for better typechecking and fewer shouty macros.
      No functional changes here.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      732436ef
  8. 13 4月, 2022 1 次提交
  9. 11 4月, 2022 1 次提交
  10. 05 12月, 2021 6 次提交
  11. 24 8月, 2021 1 次提交
    • D
      xfs: only set IOMAP_F_SHARED when providing a srcmap to a write · 72a048c1
      Darrick J. Wong 提交于
      While prototyping a free space defragmentation tool, I observed an
      unexpected IO error while running a sequence of commands that can be
      recreated by the following sequence of commands:
      
      # xfs_io -f -c "pwrite -S 0x58 -b 10m 0 10m" file1
      # cp --reflink=always file1 file2
      # punch-alternating -o 1 file2
      # xfs_io -c "funshare 0 10m" file2
      fallocate: Input/output error
      
      I then scraped this (abbreviated) stack trace from dmesg:
      
      WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450
      CPU: 0 PID: 30788 Comm: xfs_io Not tainted 5.14.0-rc6-xfsx #rc6 5ef57b62a900814b3e4d885c755e9014541c8732
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
      RIP: 0010:iomap_write_begin+0x376/0x450
      RSP: 0018:ffffc90000c0fc20 EFLAGS: 00010297
      RAX: 0000000000000001 RBX: ffffc90000c0fd10 RCX: 0000000000001000
      RDX: ffffc90000c0fc54 RSI: 000000000000000c RDI: 000000000000000c
      RBP: ffff888005d5dbd8 R08: 0000000000102000 R09: ffffc90000c0fc50
      R10: 0000000000b00000 R11: 0000000000101000 R12: ffffea0000336c40
      R13: 0000000000001000 R14: ffffc90000c0fd10 R15: 0000000000101000
      FS:  00007f4b8f62fe40(0000) GS:ffff88803ec00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000056361c554108 CR3: 000000000524e004 CR4: 00000000001706f0
      Call Trace:
       iomap_unshare_actor+0x95/0x140
       iomap_apply+0xfa/0x300
       iomap_file_unshare+0x44/0x60
       xfs_reflink_unshare+0x50/0x140 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195]
       xfs_file_fallocate+0x27c/0x610 [xfs 61947ea9b3a73e79d747dbc1b90205e7987e4195]
       vfs_fallocate+0x133/0x330
       __x64_sys_fallocate+0x3e/0x70
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f4b8f79140a
      
      Looking at the iomap tracepoints, I saw this:
      
      iomap_iter:           dev 8:64 ino 0x100 pos 0 length 0 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare
      iomap_iter_dstmap:    dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 0 length 131072 type DELALLOC flags SHARED
      iomap_iter_srcmap:    dev 8:64 ino 0x100 bdev 8:64 addr 147456 offset 0 length 4096 type MAPPED flags
      iomap_iter:           dev 8:64 ino 0x100 pos 0 length 4096 flags WRITE|0x80 (0x81) ops xfs_buffered_write_iomap_ops caller iomap_file_unshare
      iomap_iter_dstmap:    dev 8:64 ino 0x100 bdev 8:64 addr -1 offset 4096 length 4096 type DELALLOC flags SHARED
      console:              WARNING: CPU: 0 PID: 30788 at fs/iomap/buffered-io.c:577 iomap_write_begin+0x376/0x450
      
      The first time funshare calls ->iomap_begin, xfs sees that the first
      block is shared and creates a 128k delalloc reservation in the COW fork.
      The delalloc reservation is returned as dstmap, and the shared block is
      returned as srcmap.  So far so good.
      
      funshare calls ->iomap_begin to try the second block.  This time there's
      no srcmap (punch-alternating punched it out!) but we still have the
      delalloc reservation in the COW fork.  Therefore, we again return the
      reservation as dstmap and the hole as srcmap.  iomap_unshare_iter
      incorrectly tries to unshare the hole, which __iomap_write_begin rejects
      because shared regions must be fully written and therefore cannot
      require zeroing.
      
      Therefore, change the buffered write iomap_begin function not to set
      IOMAP_F_SHARED when there isn't a source mapping to read from for the
      unsharing.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
      72a048c1
  12. 20 8月, 2021 2 次提交
  13. 27 5月, 2021 1 次提交
    • G
      xfs: Fix fall-through warnings for Clang · 53004ee7
      Gustavo A. R. Silva 提交于
      In preparation to enable -Wimplicit-fallthrough for Clang, fix
      the following warnings by replacing /* fall through */ comments,
      and its variants, with the new pseudo-keyword macro fallthrough:
      
      fs/xfs/libxfs/xfs_alloc.c:3167:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_da_btree.c:286:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_ag_resv.c:346:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_ag_resv.c:388:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_bmap_util.c:246:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_export.c:88:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_export.c:96:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_file.c:867:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_ioctl.c:562:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_ioctl.c:1548:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_iomap.c:1040:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_inode.c:852:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_log.c:2627:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_trans_buf.c:298:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/bmap.c:275:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/btree.c:48:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/common.c:85:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/common.c:138:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/common.c:698:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/dabtree.c:51:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/repair.c:951:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/agheader.c:89:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      
      Notice that Clang doesn't recognize /* fall through */ comments as
      implicit fall-through markings, so in order to globally enable
      -Wimplicit-fallthrough for Clang, these comments need to be
      replaced with fallthrough; in the whole codebase.
      
      Link: https://github.com/KSPP/linux/issues/115Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
      53004ee7
  14. 16 4月, 2021 2 次提交
  15. 08 4月, 2021 2 次提交
  16. 11 2月, 2021 1 次提交
    • B
      xfs: restore shutdown check in mapped write fault path · e4826691
      Brian Foster 提交于
      XFS triggers an iomap warning in the write fault path due to a
      !PageUptodate() page if a write fault happens to occur on a page
      that recently failed writeback. The iomap writeback error handling
      code can clear the Uptodate flag if no portion of the page is
      submitted for I/O. This is reproduced by fstest generic/019, which
      combines various forms of I/O with simulated disk failures that
      inevitably lead to filesystem shutdown (which then unconditionally
      fails page writeback).
      
      This is a regression introduced by commit f150b423 ("xfs: split
      the iomap ops for buffered vs direct writes") due to the removal of
      a shutdown check and explicit error return in the ->iomap_begin()
      path used by the write fault path. The explicit error return
      historically translated to a SIGBUS, but now carries on with iomap
      processing where it complains about the unexpected state. Restore
      the shutdown check to xfs_buffered_write_iomap_begin() to restore
      historical behavior.
      
      Fixes: f150b423 ("xfs: split the iomap ops for buffered vs direct writes")
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      e4826691
  17. 04 2月, 2021 6 次提交
  18. 02 2月, 2021 1 次提交
    • D
      xfs: reduce exclusive locking on unaligned dio · ed1128c2
      Dave Chinner 提交于
      Attempt shared locking for unaligned DIO, but only if the the
      underlying extent is already allocated and in written state. On
      failure, retry with the existing exclusive locking.
      
      Test case is fio randrw of 512 byte IOs using AIO and an iodepth of
      32 IOs.
      
      Vanilla:
      
        READ: bw=4560KiB/s (4670kB/s), 4560KiB/s-4560KiB/s (4670kB/s-4670kB/s), io=134MiB (140MB), run=30001-30001msec
        WRITE: bw=4567KiB/s (4676kB/s), 4567KiB/s-4567KiB/s (4676kB/s-4676kB/s), io=134MiB (140MB), run=30001-30001msec
      
      Patched:
         READ: bw=37.6MiB/s (39.4MB/s), 37.6MiB/s-37.6MiB/s (39.4MB/s-39.4MB/s), io=1127MiB (1182MB), run=30002-30002msec
        WRITE: bw=37.6MiB/s (39.4MB/s), 37.6MiB/s-37.6MiB/s (39.4MB/s-39.4MB/s), io=1128MiB (1183MB), run=30002-30002msec
      
      That's an improvement from ~18k IOPS to a ~150k IOPS, which is
      about the IOPS limit of the VM block device setup I'm testing on.
      
      4kB block IO comparison:
      
         READ: bw=296MiB/s (310MB/s), 296MiB/s-296MiB/s (310MB/s-310MB/s), io=8868MiB (9299MB), run=30002-30002msec
        WRITE: bw=296MiB/s (310MB/s), 296MiB/s-296MiB/s (310MB/s-310MB/s), io=8878MiB (9309MB), run=30002-30002msec
      
      Which is ~150k IOPS, same as what the test gets for sub-block
      AIO+DIO writes with this patch.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [hch: rebased, split unaligned from nowait]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      ed1128c2
  19. 23 1月, 2021 2 次提交
  20. 20 11月, 2020 1 次提交
    • D
      xfs: don't allow NOWAIT DIO across extent boundaries · 883a790a
      Dave Chinner 提交于
      Jens has reported a situation where partial direct IOs can be issued
      and completed yet still return -EAGAIN. We don't want this to report
      a short IO as we want XFS to complete user DIO entirely or not at
      all.
      
      This partial IO situation can occur on a write IO that is split
      across an allocated extent and a hole, and the second mapping is
      returning EAGAIN because allocation would be required.
      
      The trivial reproducer:
      
      $ sudo xfs_io -fdt -c "pwrite 0 4k" -c "pwrite -V 1 -b 8k -N 0 8k" /mnt/scr/foo
      wrote 4096/4096 bytes at offset 0
      4 KiB, 1 ops; 0.0001 sec (27.509 MiB/sec and 7042.2535 ops/sec)
      pwrite: Resource temporarily unavailable
      $
      
      The pwritev2(0, 8kB, RWF_NOWAIT) call returns EAGAIN having done
      the first 4kB write:
      
       xfs_file_direct_write: dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 0x2000
       iomap_apply:          dev 259:1 ino 0x83 pos 0 length 8192 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
       xfs_ilock_nowait:     dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
       xfs_iunlock:          dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
       xfs_iomap_found:      dev 259:1 ino 0x83 size 0x1000 offset 0x0 count 8192 fork data startoff 0x0 startblock 24 blockcount 0x1
       iomap_apply_dstmap:   dev 259:1 ino 0x83 bdev 259:1 addr 102400 offset 0 length 4096 type MAPPED flags DIRTY
      
      Here the first iomap loop has mapped the first 4kB of the file and
      issued the IO, and we enter the second iomap_apply loop:
      
       iomap_apply: dev 259:1 ino 0x83 pos 4096 length 4096 flags WRITE|DIRECT|NOWAIT (0x31) ops xfs_direct_write_iomap_ops caller iomap_dio_rw actor iomap_dio_actor
       xfs_ilock_nowait:     dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_for_iomap
       xfs_iunlock:          dev 259:1 ino 0x83 flags ILOCK_SHARED caller xfs_direct_write_iomap_begin
      
      And we exit with -EAGAIN out because we hit the allocate case trying
      to make the second 4kB block.
      
      Then IO completes on the first 4kB and the original IO context
      completes and unlocks the inode, returning -EAGAIN to userspace:
      
       xfs_end_io_direct_write: dev 259:1 ino 0x83 isize 0x1000 disize 0x1000 offset 0x0 count 4096
       xfs_iunlock:          dev 259:1 ino 0x83 flags IOLOCK_SHARED caller xfs_file_dio_aio_write
      
      There are other vectors to the same problem when we re-enter the
      mapping code if we have to make multiple mappinfs under NOWAIT
      conditions. e.g. failing trylocks, COW extents being found,
      allocation being required, and so on.
      
      Avoid all these potential problems by only allowing IOMAP_NOWAIT IO
      to go ahead if the mapping we retrieve for the IO spans an entire
      allocated extent. This avoids the possibility of subsequent mappings
      to complete the IO from triggering NOWAIT semantics by any means as
      NOWAIT IO will now only enter the mapping code once per NOWAIT IO.
      Reported-and-tested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      883a790a