1. 04 4月, 2017 1 次提交
  2. 09 3月, 2017 1 次提交
    • B
      xfs: use iomap new flag for newly allocated delalloc blocks · f65e6fad
      Brian Foster 提交于
      Commit fa7f138a ("xfs: clear delalloc and cache on buffered write
      failure") fixed one regression in the iomap error handling code and
      exposed another. The fundamental problem is that if a buffered write
      is a rewrite of preexisting delalloc blocks and the write fails, the
      failure handling code can punch out preexisting blocks with valid
      file data.
      
      This was reproduced directly by sub-block writes in the LTP
      kernel/syscalls/write/write03 test. A first 100 byte write allocates
      a single block in a file. A subsequent 100 byte write fails and
      punches out the block, including the data successfully written by
      the previous write.
      
      To address this problem, update the ->iomap_begin() handler to
      distinguish newly allocated delalloc blocks from preexisting
      delalloc blocks via the IOMAP_F_NEW flag. Use this flag in the
      ->iomap_end() handler to decide when a failed or short write should
      punch out delalloc blocks.
      
      This introduces the subtle requirement that ->iomap_begin() should
      never combine newly allocated delalloc blocks with existing blocks
      in the resulting iomap descriptor. This can occur when a new
      delalloc reservation merges with a neighboring extent that is part
      of the current write, for example. Therefore, drop the
      post-allocation extent lookup from xfs_bmapi_reserve_delalloc() and
      just return the record inserted into the fork. This ensures only new
      blocks are returned and thus that preexisting delalloc blocks are
      always handled as "found" blocks and not punched out on a failed
      rewrite.
      Reported-by: NXiong Zhou <xzhou@redhat.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      f65e6fad
  3. 17 2月, 2017 2 次提交
    • B
      xfs: resurrect debug mode drop buffered writes mechanism · 9dbddd7b
      Brian Foster 提交于
      A debug mode write failure mechanism was introduced to XFS in commit
      801cc4e1 ("xfs: debug mode forced buffered write failure") to
      facilitate targeted testing of delalloc indirect reservation management
      from userspace. This code was subsequently rendered ineffective by the
      move to iomap based buffered writes in commit 68a9f5e7 ("xfs:
      implement iomap based buffered write path"). This likely went unnoticed
      because the associated userspace code had not made it into xfstests.
      
      Resurrect this mechanism to facilitate effective indlen reservation
      testing from xfstests. The move to iomap based buffered writes relocated
      the hook this mechanism needs to return write failure from XFS to
      generic code. The failure trigger must remain in XFS. Given that
      limitation, convert this from a write failure mechanism to one that
      simply drops writes without returning failure to userspace. Rename all
      "fail_writes" references to "drop_writes" to illustrate the point. This
      is more hacky than preferred, but still triggers the XFS error handling
      behavior required to drive the indlen tests. This is only available in
      DEBUG mode and for testing purposes only.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      9dbddd7b
    • B
      xfs: clear delalloc and cache on buffered write failure · fa7f138a
      Brian Foster 提交于
      The buffered write failure handling code in
      xfs_file_iomap_end_delalloc() has a couple minor problems. First, if
      written == 0, start_fsb is not rounded down and it fails to kill off a
      delalloc block if the start offset is block unaligned. This results in a
      lingering delalloc block and broken delalloc block accounting detected
      at unmount time. Fix this by rounding down start_fsb in the unlikely
      event that written == 0.
      
      Second, it is possible for a failed overwrite of a delalloc extent to
      leave dirty pagecache around over a hole in the file. This is because is
      possible to hit ->iomap_end() on write failure before the iomap code has
      attempted to allocate pagecache, and thus has no need to clean it up. If
      the targeted delalloc extent was successfully written by a previous
      write, however, then it does still have dirty pages when ->iomap_end()
      punches out the underlying blocks. This ultimately results in writeback
      over a hole. To fix this problem, unconditionally punch out the
      pagecache from XFS before the associated delalloc range.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      fa7f138a
  4. 07 2月, 2017 3 次提交
  5. 03 2月, 2017 1 次提交
    • D
      xfs: mark speculative prealloc CoW fork extents unwritten · 5eda4300
      Darrick J. Wong 提交于
      Christoph Hellwig pointed out that there's a potentially nasty race when
      performing simultaneous nearby directio cow writes:
      
      "Thread 1 writes a range from B to c
      
      "                    B --------- C
                                 p
      
      "a little later thread 2 writes from A to B
      
      "        A --------- B
                     p
      
      [editor's note: the 'p' denote cowextsize boundaries, which I added to
      make this more clear]
      
      "but the code preallocates beyond B into the range where thread
      "1 has just written, but ->end_io hasn't been called yet.
      "But once ->end_io is called thread 2 has already allocated
      "up to the extent size hint into the write range of thread 1,
      "so the end_io handler will splice the unintialized blocks from
      "that preallocation back into the file right after B."
      
      We can avoid this race by ensuring that thread 1 cannot accidentally
      remap the blocks that thread 2 allocated (as part of speculative
      preallocation) as part of t2's write preparation in t1's end_io handler.
      The way we make this happen is by taking advantage of the unwritten
      extent flag as an intermediate step.
      
      Recall that when we begin the process of writing data to shared blocks,
      we create a delayed allocation extent in the CoW fork:
      
      D: --RRRRRRSSSRRRRRRRR---
      C: ------DDDDDDD---------
      
      When a thread prepares to CoW some dirty data out to disk, it will now
      convert the delalloc reservation into an /unwritten/ allocated extent in
      the cow fork.  The da conversion code tries to opportunistically
      allocate as much of a (speculatively prealloc'd) extent as possible, so
      we may end up allocating a larger extent than we're actually writing
      out:
      
      D: --RRRRRRSSSRRRRRRRR---
      U: ------UUUUUUU---------
      
      Next, we convert only the part of the extent that we're actively
      planning to write to normal (i.e. not unwritten) status:
      
      D: --RRRRRRSSSRRRRRRRR---
      U: ------UURRUUU---------
      
      If the write succeeds, the end_cow function will now scan the relevant
      range of the CoW fork for real extents and remap only the real extents
      into the data fork:
      
      D: --RRRRRRRRSRRRRRRRR---
      U: ------UU--UUU---------
      
      This ensures that we never obliterate valid data fork extents with
      unwritten blocks from the CoW fork.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      5eda4300
  6. 31 1月, 2017 1 次提交
  7. 24 1月, 2017 1 次提交
    • C
      xfs: fix COW writeback race · d2b3964a
      Christoph Hellwig 提交于
      Due to the way how xfs_iomap_write_allocate tries to convert the whole
      found extents from delalloc to real space we can run into a race
      condition with multiple threads doing writes to this same extent.
      For the non-COW case that is harmless as the only thing that can happen
      is that we call xfs_bmapi_write on an extent that has already been
      converted to a real allocation.  For COW writes where we move the extent
      from the COW to the data fork after I/O completion the race is, however,
      not quite as harmless.  In the worst case we are now calling
      xfs_bmapi_write on a region that contains hole in the COW work, which
      will trip up an assert in debug builds or lead to file system corruption
      in non-debug builds.  This seems to be reproducible with workloads of
      small O_DSYNC write, although so far I've not managed to come up with
      a with an isolated reproducer.
      
      The fix for the issue is relatively simple:  tell xfs_bmapi_write
      that we are only asked to convert delayed allocations and skip holes
      in that case.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      d2b3964a
  8. 30 11月, 2016 1 次提交
    • C
      xfs: use iomap_dio_rw · acdda3aa
      Christoph Hellwig 提交于
      Straight switch over to using iomap for direct I/O - we already have the
      non-COW dio path in write_begin for DAX and files with extent size hints,
      so nothing to add there.  The COW path is ported over from the old
      get_blocks version and a bit of a mess, but I have some work in progress
      to make it look more like the buffered I/O COW path.
      
      This gets rid of xfs_get_blocks_direct and the last caller of
      xfs_get_blocks with the create flag set, so all that code can be removed.
      
      Last but not least I've removed a comment in xfs_filemap_fault that
      refers to xfs_get_blocks entirely instead of updating it - while the
      reference is correct, the whole DAX fault path looks different than
      the non-DAX one, so it seems rather pointless.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NJens Axboe <axboe@fb.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      acdda3aa
  9. 28 11月, 2016 2 次提交
    • B
      xfs: pass post-eof speculative prealloc blocks to bmapi · f782088c
      Brian Foster 提交于
      xfs_file_iomap_begin_delay() implements post-eof speculative
      preallocation by extending the block count of the requested delayed
      allocation. Now that xfs_bmapi_reserve_delalloc() has been updated to
      handle prealloc blocks separately and tag the inode, update
      xfs_file_iomap_begin_delay() to use the new parameter and rely on the
      former to tag the inode.
      
      Note that this patch does not change behavior.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      f782088c
    • B
      xfs: track preallocation separately in xfs_bmapi_reserve_delalloc() · 974ae922
      Brian Foster 提交于
      Speculative preallocation is currently processed entirely by the callers
      of xfs_bmapi_reserve_delalloc(). The caller determines how much
      preallocation to include, adjusts the extent length and passes down the
      resulting request.
      
      While this works fine for post-eof speculative preallocation, it is not
      as reliable for COW fork preallocation. COW fork preallocation is
      implemented via the cowextszhint, which aligns the start offset as well
      as the length of the extent. Further, it is difficult for the caller to
      accurately identify when preallocation occurs because the returned
      extent could have been merged with neighboring extents in the fork.
      
      To simplify this situation and facilitate further COW fork preallocation
      enhancements, update xfs_bmapi_reserve_delalloc() to take a separate
      preallocation parameter to incorporate into the allocation request. The
      preallocation blocks value is tacked onto the end of the request and
      adjusted to accommodate neighboring extents and extent size limits.
      Since xfs_bmapi_reserve_delalloc() now knows precisely how much
      preallocation was included in the allocation, it can also tag the inodes
      appropriately to support preallocation reclaim.
      
      Note that xfs_bmapi_reserve_delalloc() callers are not yet updated to
      use the preallocation mechanism. This patch should not change behavior
      outside of correctly tagging reflink inodes when start offset
      preallocation occurs (which the caller does not handle correctly).
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      974ae922
  10. 24 11月, 2016 2 次提交
  11. 20 10月, 2016 2 次提交
  12. 06 10月, 2016 2 次提交
    • D
      xfs: create a separate cow extent size hint for the allocator · f7ca3522
      Darrick J. Wong 提交于
      Create a per-inode extent size allocator hint for copy-on-write.  This
      hint is separate from the existing extent size hint so that CoW can
      take advantage of the fragmentation-reducing properties of extent size
      hints without disabling delalloc for regular writes.
      
      The extent size hint that's fed to the allocator during a copy on
      write operation is the greater of the cowextsize and regular extsize
      hint.
      
      During reflink, if we're sharing the entire source file to the entire
      destination file and the destination file doesn't already have a
      cowextsize hint, propagate the source file's cowextsize hint to the
      destination file.
      
      Furthermore, zero the bulkstat buffer prior to setting the fields
      so that we don't copy kernel memory contents into userspace.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      f7ca3522
    • D
      xfs: report shared extent mappings to userspace correctly · db1327b1
      Darrick J. Wong 提交于
      Report shared extents through the iomap interface so that FIEMAP flags
      shared blocks accurately.  Have xfs_vm_bmap return zero for reflinked
      files because the bmap-based swap code requires static block mappings,
      which is incompatible with copy on write.
      
      NOTE: Existing userspace bmap users such as lilo will have the same
      problem with reflink files.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      db1327b1
  13. 05 10月, 2016 3 次提交
  14. 19 9月, 2016 6 次提交
  15. 17 8月, 2016 3 次提交
  16. 03 8月, 2016 3 次提交
  17. 21 6月, 2016 2 次提交
  18. 06 4月, 2016 1 次提交
    • C
      xfs: better xfs_trans_alloc interface · 253f4911
      Christoph Hellwig 提交于
      Merge xfs_trans_reserve and xfs_trans_alloc into a single function call
      that returns a transaction with all the required log and block reservations,
      and which allows passing transaction flags directly to avoid the cumbersome
      _xfs_trans_alloc interface.
      
      While we're at it we also get rid of the transaction type argument that has
      been superflous since we stopped supporting the non-CIL logging mode.  The
      guts of it will be removed in another patch.
      
      [dchinner: fixed transaction leak in error path in xfs_setattr_nonsize]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      253f4911
  19. 11 1月, 2016 1 次提交
    • E
      xfs: eliminate committed arg from xfs_bmap_finish · f6106efa
      Eric Sandeen 提交于
      Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the
      associated comments were replicated several times across
      the attribute code, all dealing with what to do if the
      transaction was or wasn't committed.
      
      And in that replicated code, an ASSERT() test of an
      uninitialized variable occurs in several locations:
      
      	error = xfs_attr_thing(&args);
      	if (!error) {
      		error = xfs_bmap_finish(&args.trans, args.flist,
      					&committed);
      	}
      	if (error) {
      		ASSERT(committed);
      
      If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish,
      never set "committed", and then test it in the ASSERT.
      
      Fix this up by moving the committed state internal to xfs_bmap_finish,
      and add a new inode argument.  If an inode is passed in, it is passed
      through to __xfs_trans_roll() and joined to the transaction there if
      the transaction was committed.
      
      xfs_qm_dqalloc() was a little unique in that it called bjoin rather
      than ijoin, but as Dave points out we can detect the committed state
      but checking whether (*tpp != tp).
      
      Addresses-Coverity-Id: 102360
      Addresses-Coverity-Id: 102361
      Addresses-Coverity-Id: 102363
      Addresses-Coverity-Id: 102364
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f6106efa
  20. 04 1月, 2016 1 次提交
    • D
      xfs: Don't use reserved blocks for data blocks with DAX · 3b0fe478
      Dave Chinner 提交于
      Commit 1ca19157 ("xfs: Don't use unwritten extents for DAX") enabled
      the DAX allocation call to dip into the reserve pool in case it was
      converting unwritten extents rather than allocating blocks. This was
      a direct copy of the unwritten extent conversion code, but had an
      unintended side effect of allowing normal data block allocation to
      use the reserve pool. Hence normal block allocation could deplete
      the reserve pool and prevent unwritten extent conversion at ENOSPC,
      hence violating fallocate guarantees on preallocated space.
      
      Fix it by checking whether the incoming map from __xfs_get_blocks()
      spans an unwritten extent and only use the reserve pool if the
      allocation covers an unwritten extent.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3b0fe478
  21. 03 11月, 2015 1 次提交
    • D
      xfs: Don't use unwritten extents for DAX · 1ca19157
      Dave Chinner 提交于
      DAX has a page fault serialisation problem with block allocation.
      Because it allows concurrent page faults and does not have a page
      lock to serialise faults to the same page, it can get two concurrent
      faults to the page that race.
      
      When two read faults race, this isn't a huge problem as the data
      underlying the page is not changing and so "detect and drop" works
      just fine. The issues are to do with write faults.
      
      When two write faults occur, we serialise block allocation in
      get_blocks() so only one faul will allocate the extent. It will,
      however, be marked as an unwritten extent, and that is where the
      problem lies - the DAX fault code cannot differentiate between a
      block that was just allocated and a block that was preallocated and
      needs zeroing. The result is that both write faults end up zeroing
      the block and attempting to convert it back to written.
      
      The problem is that the first fault can zero and convert before the
      second fault starts zeroing, resulting in the zeroing for the second
      fault overwriting the data that the first fault wrote with zeros.
      The second fault then attempts to convert the unwritten extent,
      which is then a no-op because it's already written. Data loss occurs
      as a result of this race.
      
      Because there is no sane locking construct in the page fault code
      that we can use for serialisation across the page faults, we need to
      ensure block allocation and zeroing occurs atomically in the
      filesystem. This means we can still take concurrent page faults and
      the only time they will serialise is in the filesystem
      mapping/allocation callback. The page fault code will always see
      written, initialised extents, so we will be able to remove the
      unwritten extent handling from the DAX code when all filesystems are
      converted.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      1ca19157