1. 07 11月, 2022 1 次提交
    • D
      xfs: write page faults in iomap are not buffered writes · 118e021b
      Dave Chinner 提交于
      When we reserve a delalloc region in xfs_buffered_write_iomap_begin,
      we mark the iomap as IOMAP_F_NEW so that the the write context
      understands that it allocated the delalloc region.
      
      If we then fail that buffered write, xfs_buffered_write_iomap_end()
      checks for the IOMAP_F_NEW flag and if it is set, it punches out
      the unused delalloc region that was allocated for the write.
      
      The assumption this code makes is that all buffered write operations
      that can allocate space are run under an exclusive lock (i_rwsem).
      This is an invalid assumption: page faults in mmap()d regions call
      through this same function pair to map the file range being faulted
      and this runs only holding the inode->i_mapping->invalidate_lock in
      shared mode.
      
      IOWs, we can have races between page faults and write() calls that
      fail the nested page cache write operation that result in data loss.
      That is, the failing iomap_end call will punch out the data that
      the other racing iomap iteration brought into the page cache. This
      can be reproduced with generic/34[46] if we arbitrarily fail page
      cache copy-in operations from write() syscalls.
      
      Code analysis tells us that the iomap_page_mkwrite() function holds
      the already instantiated and uptodate folio locked across the iomap
      mapping iterations. Hence the folio cannot be removed from memory
      whilst we are mapping the range it covers, and as such we do not
      care if the mapping changes state underneath the iomap iteration
      loop:
      
      1. if the folio is not already dirty, there is no writeback races
         possible.
      2. if we allocated the mapping (delalloc or unwritten), the folio
         cannot already be dirty. See #1.
      3. If the folio is already dirty, it must be up to date. As we hold
         it locked, it cannot be reclaimed from memory. Hence we always
         have valid data in the page cache while iterating the mapping.
      4. Valid data in the page cache can exist when the underlying
         mapping is DELALLOC, UNWRITTEN or WRITTEN. Having the mapping
         change from DELALLOC->UNWRITTEN or UNWRITTEN->WRITTEN does not
         change the data in the page - it only affects actions if we are
         initialising a new page. Hence #3 applies  and we don't care
         about these extent map transitions racing with
         iomap_page_mkwrite().
      5. iomap_page_mkwrite() checks for page invalidation races
         (truncate, hole punch, etc) after it locks the folio. We also
         hold the mapping->invalidation_lock here, and hence the mapping
         cannot change due to extent removal operations while we are
         iterating the folio.
      
      As such, filesystems that don't use bufferheads will never fail
      the iomap_folio_mkwrite_iter() operation on the current mapping,
      regardless of whether the iomap should be considered stale.
      
      Further, the range we are asked to iterate is limited to the range
      inside EOF that the folio spans. Hence, for XFS, we will only map
      the exact range we are asked for, and we will only do speculative
      preallocation with delalloc if we are mapping a hole at the EOF
      page. The iterator will consume the entire range of the folio that
      is within EOF, and anything beyond the EOF block cannot be accessed.
      We never need to truncate this post-EOF speculative prealloc away in
      the context of the iomap_page_mkwrite() iterator because if it
      remains unused we'll remove it when the last reference to the inode
      goes away.
      
      Hence we don't actually need an .iomap_end() cleanup/error handling
      path at all for iomap_page_mkwrite() for XFS. This means we can
      separate the page fault processing from the complexity of the
      .iomap_end() processing in the buffered write path. This also means
      that the buffered write path will also be able to take the
      mapping->invalidate_lock as necessary.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      118e021b
  2. 31 10月, 2022 1 次提交
    • D
      xfs: fix incorrect return type for fsdax fault handlers · 47ba8cc7
      Darrick J. Wong 提交于
      The kernel robot complained about this:
      
      >> fs/xfs/xfs_file.c:1266:31: sparse: sparse: incorrect type in return expression (different base types) @@     expected int @@     got restricted vm_fault_t @@
         fs/xfs/xfs_file.c:1266:31: sparse:     expected int
         fs/xfs/xfs_file.c:1266:31: sparse:     got restricted vm_fault_t
         fs/xfs/xfs_file.c:1314:21: sparse: sparse: incorrect type in assignment (different base types) @@     expected restricted vm_fault_t [usertype] ret @@     got int @@
         fs/xfs/xfs_file.c:1314:21: sparse:     expected restricted vm_fault_t [usertype] ret
         fs/xfs/xfs_file.c:1314:21: sparse:     got int
      
      Fix the incorrect return type for these two functions.
      
      While we're at it, make the !fsdax version return VM_FAULT_SIGBUS
      because a zero return value will cause some callers to try to lock
      vmf->page, which we never set here.
      
      Fixes: ea6c49b7 ("xfs: support CoW in fsdax mode")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      47ba8cc7
  3. 06 8月, 2022 1 次提交
  4. 25 7月, 2022 1 次提交
  5. 18 7月, 2022 2 次提交
  6. 22 5月, 2022 1 次提交
  7. 16 5月, 2022 1 次提交
  8. 21 4月, 2022 2 次提交
  9. 12 4月, 2022 1 次提交
  10. 02 2月, 2022 5 次提交
  11. 18 1月, 2022 1 次提交
    • D
      xfs: kill the XFS_IOC_{ALLOC,FREE}SP* ioctls · 4d1b97f9
      Darrick J. Wong 提交于
      According to the glibc compat header for Irix 4, these ioctls originated
      in April 1991 as a (somewhat clunky) way to preallocate space at the end
      of a file on an EFS filesystem.  XFS, which was released in Irix 5.3 in
      December 1993, picked up these ioctls to maintain compatibility and they
      were ported to Linux in the early 2000s.
      
      Recently it was pointed out to me they still lurk in the kernel, even
      though the Linux fallocate syscall supplanted the functionality a long
      time ago.  fstests doesn't seem to include any real functional or stress
      tests for these ioctls, which means that the code quality is ... very
      questionable.  Most notably, it was a stale disk block exposure vector
      for 21 years and nobody noticed or complained.  As mature programmers
      say, "If you're not testing it, it's broken."
      
      Given all that, let's withdraw these ioctls from the XFS userspace API.
      Normally we'd set a long deprecation process, but I estimate that there
      aren't any real users, so let's trigger a warning in dmesg and return
      -ENOTTY.
      
      See: CVE-2021-4155
      
      Augments: 983d8e60 ("xfs: map unwritten blocks in XFS_IOC_{ALLOC,FREE}SP just like fallocate")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      4d1b97f9
  12. 05 12月, 2021 1 次提交
  13. 24 10月, 2021 1 次提交
    • A
      iomap: Add done_before argument to iomap_dio_rw · 4fdccaa0
      Andreas Gruenbacher 提交于
      Add a done_before argument to iomap_dio_rw that indicates how much of
      the request has already been transferred.  When the request succeeds, we
      report that done_before additional bytes were tranferred.  This is
      useful for finishing a request asynchronously when part of the request
      has already been completed synchronously.
      
      We'll use that to allow iomap_dio_rw to be used with page faults
      disabled: when a page fault occurs while submitting a request, we
      synchronously complete the part of the request that has already been
      submitted.  The caller can then take care of the page fault and call
      iomap_dio_rw again for the rest of the request, passing in the number of
      bytes already tranferred.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      4fdccaa0
  14. 18 10月, 2021 1 次提交
  15. 20 8月, 2021 3 次提交
  16. 13 7月, 2021 1 次提交
  17. 22 6月, 2021 2 次提交
  18. 09 6月, 2021 2 次提交
  19. 03 6月, 2021 1 次提交
    • D
      xfs: don't take a spinlock unconditionally in the DIO fastpath · 977ec4dd
      Dave Chinner 提交于
      Because this happens at high thread counts on high IOPS devices
      doing mixed read/write AIO-DIO to a single file at about a million
      iops:
      
         64.09%     0.21%  [kernel]            [k] io_submit_one
         - 63.87% io_submit_one
            - 44.33% aio_write
               - 42.70% xfs_file_write_iter
                  - 41.32% xfs_file_dio_write_aligned
                     - 25.51% xfs_file_write_checks
                        - 21.60% _raw_spin_lock
                           - 21.59% do_raw_spin_lock
                              - 19.70% __pv_queued_spin_lock_slowpath
      
      This also happens of the IO completion IO path:
      
         22.89%     0.69%  [kernel]            [k] xfs_dio_write_end_io
         - 22.49% xfs_dio_write_end_io
            - 21.79% _raw_spin_lock
               - 20.97% do_raw_spin_lock
                  - 20.10% __pv_queued_spin_lock_slowpath
      
      IOWs, fio is burning ~14 whole CPUs on this spin lock.
      
      So, do an unlocked check against inode size first, then if we are
      at/beyond EOF, take the spinlock and recheck. This makes the
      spinlock disappear from the overwrite fastpath.
      
      I'd like to report that fixing this makes things go faster. It
      doesn't - it just exposes the the XFS_ILOCK as the next severe
      contention point doing extent mapping lookups, and that now burns
      all the 14 CPUs this spinlock was burning.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      977ec4dd
  20. 27 5月, 2021 1 次提交
    • G
      xfs: Fix fall-through warnings for Clang · 53004ee7
      Gustavo A. R. Silva 提交于
      In preparation to enable -Wimplicit-fallthrough for Clang, fix
      the following warnings by replacing /* fall through */ comments,
      and its variants, with the new pseudo-keyword macro fallthrough:
      
      fs/xfs/libxfs/xfs_alloc.c:3167:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_da_btree.c:286:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_ag_resv.c:346:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/libxfs/xfs_ag_resv.c:388:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_bmap_util.c:246:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_export.c:88:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_export.c:96:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_file.c:867:3: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_ioctl.c:562:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_ioctl.c:1548:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_iomap.c:1040:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_inode.c:852:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_log.c:2627:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/xfs_trans_buf.c:298:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/bmap.c:275:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/btree.c:48:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/common.c:85:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/common.c:138:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/common.c:698:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/dabtree.c:51:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/repair.c:951:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      fs/xfs/scrub/agheader.c:89:2: warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]
      
      Notice that Clang doesn't recognize /* fall through */ comments as
      implicit fall-through markings, so in order to globally enable
      -Wimplicit-fallthrough for Clang, these comments need to be
      replaced with fallthrough; in the whole codebase.
      
      Link: https://github.com/KSPP/linux/issues/115Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
      53004ee7
  21. 08 4月, 2021 4 次提交
  22. 04 2月, 2021 4 次提交
  23. 02 2月, 2021 2 次提交
    • D
      xfs: reduce exclusive locking on unaligned dio · ed1128c2
      Dave Chinner 提交于
      Attempt shared locking for unaligned DIO, but only if the the
      underlying extent is already allocated and in written state. On
      failure, retry with the existing exclusive locking.
      
      Test case is fio randrw of 512 byte IOs using AIO and an iodepth of
      32 IOs.
      
      Vanilla:
      
        READ: bw=4560KiB/s (4670kB/s), 4560KiB/s-4560KiB/s (4670kB/s-4670kB/s), io=134MiB (140MB), run=30001-30001msec
        WRITE: bw=4567KiB/s (4676kB/s), 4567KiB/s-4567KiB/s (4676kB/s-4676kB/s), io=134MiB (140MB), run=30001-30001msec
      
      Patched:
         READ: bw=37.6MiB/s (39.4MB/s), 37.6MiB/s-37.6MiB/s (39.4MB/s-39.4MB/s), io=1127MiB (1182MB), run=30002-30002msec
        WRITE: bw=37.6MiB/s (39.4MB/s), 37.6MiB/s-37.6MiB/s (39.4MB/s-39.4MB/s), io=1128MiB (1183MB), run=30002-30002msec
      
      That's an improvement from ~18k IOPS to a ~150k IOPS, which is
      about the IOPS limit of the VM block device setup I'm testing on.
      
      4kB block IO comparison:
      
         READ: bw=296MiB/s (310MB/s), 296MiB/s-296MiB/s (310MB/s-310MB/s), io=8868MiB (9299MB), run=30002-30002msec
        WRITE: bw=296MiB/s (310MB/s), 296MiB/s-296MiB/s (310MB/s-310MB/s), io=8878MiB (9309MB), run=30002-30002msec
      
      Which is ~150k IOPS, same as what the test gets for sub-block
      AIO+DIO writes with this patch.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [hch: rebased, split unaligned from nowait]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      ed1128c2
    • D
      xfs: split the unaligned DIO write code out · caa89dbc
      Dave Chinner 提交于
      The unaligned DIO write path is more convolted than the normal path,
      and we are about to make it more complex. Keep the block aligned
      fast path dio write code trim and simple by splitting out the
      unaligned DIO code from it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      [hch: rebased, fixed a few minor nits]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      caa89dbc