1. 20 10月, 2016 2 次提交
  2. 06 10月, 2016 2 次提交
    • D
      xfs: create a separate cow extent size hint for the allocator · f7ca3522
      Darrick J. Wong 提交于
      Create a per-inode extent size allocator hint for copy-on-write.  This
      hint is separate from the existing extent size hint so that CoW can
      take advantage of the fragmentation-reducing properties of extent size
      hints without disabling delalloc for regular writes.
      
      The extent size hint that's fed to the allocator during a copy on
      write operation is the greater of the cowextsize and regular extsize
      hint.
      
      During reflink, if we're sharing the entire source file to the entire
      destination file and the destination file doesn't already have a
      cowextsize hint, propagate the source file's cowextsize hint to the
      destination file.
      
      Furthermore, zero the bulkstat buffer prior to setting the fields
      so that we don't copy kernel memory contents into userspace.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      f7ca3522
    • D
      xfs: report shared extent mappings to userspace correctly · db1327b1
      Darrick J. Wong 提交于
      Report shared extents through the iomap interface so that FIEMAP flags
      shared blocks accurately.  Have xfs_vm_bmap return zero for reflinked
      files because the bmap-based swap code requires static block mappings,
      which is incompatible with copy on write.
      
      NOTE: Existing userspace bmap users such as lilo will have the same
      problem with reflink files.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      db1327b1
  3. 05 10月, 2016 3 次提交
  4. 19 9月, 2016 6 次提交
  5. 17 8月, 2016 3 次提交
  6. 03 8月, 2016 3 次提交
  7. 21 6月, 2016 2 次提交
  8. 06 4月, 2016 1 次提交
    • C
      xfs: better xfs_trans_alloc interface · 253f4911
      Christoph Hellwig 提交于
      Merge xfs_trans_reserve and xfs_trans_alloc into a single function call
      that returns a transaction with all the required log and block reservations,
      and which allows passing transaction flags directly to avoid the cumbersome
      _xfs_trans_alloc interface.
      
      While we're at it we also get rid of the transaction type argument that has
      been superflous since we stopped supporting the non-CIL logging mode.  The
      guts of it will be removed in another patch.
      
      [dchinner: fixed transaction leak in error path in xfs_setattr_nonsize]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      253f4911
  9. 11 1月, 2016 1 次提交
    • E
      xfs: eliminate committed arg from xfs_bmap_finish · f6106efa
      Eric Sandeen 提交于
      Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the
      associated comments were replicated several times across
      the attribute code, all dealing with what to do if the
      transaction was or wasn't committed.
      
      And in that replicated code, an ASSERT() test of an
      uninitialized variable occurs in several locations:
      
      	error = xfs_attr_thing(&args);
      	if (!error) {
      		error = xfs_bmap_finish(&args.trans, args.flist,
      					&committed);
      	}
      	if (error) {
      		ASSERT(committed);
      
      If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish,
      never set "committed", and then test it in the ASSERT.
      
      Fix this up by moving the committed state internal to xfs_bmap_finish,
      and add a new inode argument.  If an inode is passed in, it is passed
      through to __xfs_trans_roll() and joined to the transaction there if
      the transaction was committed.
      
      xfs_qm_dqalloc() was a little unique in that it called bjoin rather
      than ijoin, but as Dave points out we can detect the committed state
      but checking whether (*tpp != tp).
      
      Addresses-Coverity-Id: 102360
      Addresses-Coverity-Id: 102361
      Addresses-Coverity-Id: 102363
      Addresses-Coverity-Id: 102364
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f6106efa
  10. 04 1月, 2016 1 次提交
    • D
      xfs: Don't use reserved blocks for data blocks with DAX · 3b0fe478
      Dave Chinner 提交于
      Commit 1ca19157 ("xfs: Don't use unwritten extents for DAX") enabled
      the DAX allocation call to dip into the reserve pool in case it was
      converting unwritten extents rather than allocating blocks. This was
      a direct copy of the unwritten extent conversion code, but had an
      unintended side effect of allowing normal data block allocation to
      use the reserve pool. Hence normal block allocation could deplete
      the reserve pool and prevent unwritten extent conversion at ENOSPC,
      hence violating fallocate guarantees on preallocated space.
      
      Fix it by checking whether the incoming map from __xfs_get_blocks()
      spans an unwritten extent and only use the reserve pool if the
      allocation covers an unwritten extent.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3b0fe478
  11. 03 11月, 2015 1 次提交
    • D
      xfs: Don't use unwritten extents for DAX · 1ca19157
      Dave Chinner 提交于
      DAX has a page fault serialisation problem with block allocation.
      Because it allows concurrent page faults and does not have a page
      lock to serialise faults to the same page, it can get two concurrent
      faults to the page that race.
      
      When two read faults race, this isn't a huge problem as the data
      underlying the page is not changing and so "detect and drop" works
      just fine. The issues are to do with write faults.
      
      When two write faults occur, we serialise block allocation in
      get_blocks() so only one faul will allocate the extent. It will,
      however, be marked as an unwritten extent, and that is where the
      problem lies - the DAX fault code cannot differentiate between a
      block that was just allocated and a block that was preallocated and
      needs zeroing. The result is that both write faults end up zeroing
      the block and attempting to convert it back to written.
      
      The problem is that the first fault can zero and convert before the
      second fault starts zeroing, resulting in the zeroing for the second
      fault overwriting the data that the first fault wrote with zeros.
      The second fault then attempts to convert the unwritten extent,
      which is then a no-op because it's already written. Data loss occurs
      as a result of this race.
      
      Because there is no sane locking construct in the page fault code
      that we can use for serialisation across the page faults, we need to
      ensure block allocation and zeroing occurs atomically in the
      filesystem. This means we can still take concurrent page faults and
      the only time they will serialise is in the filesystem
      mapping/allocation callback. The page fault code will always see
      written, initialised extents, so we will be able to remove the
      unwritten extent handling from the DAX code when all filesystems are
      converted.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      1ca19157
  12. 12 10月, 2015 3 次提交
    • B
      xfs: per-filesystem stats counter implementation · ff6d6af2
      Bill O'Donnell 提交于
      This patch modifies the stats counting macros and the callers
      to those macros to properly increment, decrement, and add-to
      the xfs stats counts. The counts for global and per-fs stats
      are correctly advanced, and cleared by writing a "1" to the
      corresponding clear file.
      
      global counts: /sys/fs/xfs/stats/stats
      per-fs counts: /sys/fs/xfs/sda*/stats/stats
      
      global clear:  /sys/fs/xfs/stats/stats_clear
      per-fs clear:  /sys/fs/xfs/sda*/stats/stats_clear
      
      [dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around
       stats structures and macros. ]
      Signed-off-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ff6d6af2
    • B
      xfs: pass total block res. as total xfs_bmapi_write() parameter · dbd5c8c9
      Brian Foster 提交于
      The total field from struct xfs_alloc_arg is a bit of an unknown
      commodity. It is documented as the total block requirement for the
      transaction and is used in this manner from most call sites by virtue of
      passing the total block reservation of the transaction associated with
      an allocation. Several xfs_bmapi_write() callers pass hardcoded values
      of 0 or 1 for the total block requirement, which is a historical oddity
      without any clear reasoning.
      
      The xfs_iomap_write_direct() caller, for example, passes 0 for the total
      block requirement. This has been determined to cause problems in the
      form of ABBA deadlocks of AGF buffers due to incorrect AG selection in
      the block allocator. Specifically, the xfs_alloc_space_available()
      function incorrectly selects an AG that doesn't actually have sufficient
      space for the allocation. This occurs because the args.total field is 0
      and thus the remaining free space check on the AG doesn't actually
      consider the size of the allocation request. This locks the AGF buffer,
      the allocation attempt proceeds and ultimately fails (in
      xfs_alloc_fix_minleft()), and xfs_alloc_vexent() moves on to the next
      AG. In turn, this can lead to incorrect AG locking order (if the
      allocator wraps around, attempting to lock AG 0 after acquiring AG N)
      and thus deadlock if racing with another operation. This problem has
      been reproduced via generic/299 on smallish (1GB) ramdisk test devices.
      
      To avoid this problem, replace the undocumented hardcoded total
      parameters from the iomap and utility callers to pass the block
      reservation used for the associated transaction. This is consistent with
      other xfs_bmapi_write() callers throughout XFS. The assumption is that
      the total field allows the selection of an AG that can handle the entire
      operation rather than simply the allocation/range being requested (e.g.,
      resulting btree splits, etc.). This addresses the aforementioned
      generic/299 hang by ensuring AG selection only occurs when the
      allocation can be satisfied by the AG.
      Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      dbd5c8c9
    • B
      xfs: add missing ilock around dio write last extent alignment · 009c6e87
      Brian Foster 提交于
      The iomap codepath (via get_blocks()) acquires and release the inode
      lock in the case of a direct write that requires block allocation. This
      is because xfs_iomap_write_direct() allocates a transaction, which means
      the ilock must be dropped and reacquired after the transaction is
      allocated and reserved.
      
      xfs_iomap_write_direct() invokes xfs_iomap_eof_align_last_fsb() before
      the transaction is created and thus before the ilock is reacquired. This
      can lead to calls to xfs_iread_extents() and reads of the in-core extent
      list without any synchronization (via xfs_bmap_eof() and
      xfs_bmap_last_extent()). xfs_iread_extents() assert fails if the ilock
      is not held, but this is not currently seen in practice as the current
      callers had already invoked xfs_bmapi_read().
      
      What has been seen in practice are reports of crashes down in the
      xfs_bmap_eof() codepath on direct writes due to seemingly bogus pointer
      references from xfs_iext_get_ext(). While an explicit reproducer is not
      currently available to confirm the cause of the problem, crash analysis
      and code inspection from David Jeffrey had identified the insufficient
      locking.
      
      xfs_iomap_eof_align_last_fsb() is called from other contexts with the
      inode lock already held, so we cannot acquire it therein.
      __xfs_get_blocks() acquires and drops the ilock with variable flags to
      cover the event that the extent list must be read in. The common case is
      that __xfs_get_blocks() acquires the shared ilock. To provide locking
      around the last extent alignment call without adding more lock cycles to
      the dio path, update xfs_iomap_write_direct() to expect the shared ilock
      held on entry and do the extent alignment under its protection. Demote
      the lock, if necessary, from __xfs_get_blocks() and push the
      xfs_qm_dqattach() call outside of the shared lock critical section.
      Also, add an assert to document that the extent list is always expected
      to be present in this path. Otherwise, we risk a call to
      xfs_iread_extents() while under the shared ilock. This is safe as all
      current callers have executed an xfs_bmapi_read() call under the current
      iolock context.
      Reported-by: NDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      009c6e87
  13. 04 6月, 2015 2 次提交
    • C
      xfs: saner xfs_trans_commit interface · 70393313
      Christoph Hellwig 提交于
      The flags argument to xfs_trans_commit is not useful for most callers, as
      a commit of a transaction without a permanent log reservation must pass
      0 here, and all callers for a transaction with a permanent log reservation
      except for xfs_trans_roll must pass XFS_TRANS_RELEASE_LOG_RES.  So remove
      the flags argument from the public xfs_trans_commit interfaces, and
      introduce low-level __xfs_trans_commit variant just for xfs_trans_roll
      that regrants a log reservation instead of releasing it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      70393313
    • C
      xfs: remove the flags argument to xfs_trans_cancel · 4906e215
      Christoph Hellwig 提交于
      xfs_trans_cancel takes two flags arguments: XFS_TRANS_RELEASE_LOG_RES and
      XFS_TRANS_ABORT.  Both of them are a direct product of the transaction
      state, and can be deducted:
      
       - any dirty transaction needs XFS_TRANS_ABORT to be properly canceled,
         and XFS_TRANS_ABORT is a noop for a transaction that is not dirty.
       - any transaction with a permanent log reservation needs
         XFS_TRANS_RELEASE_LOG_RES to be properly canceled, and passing
         XFS_TRANS_RELEASE_LOG_RES for a transaction without a permanent
         log reservation is invalid.
      
      So just remove the flags argument and do the right thing.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      4906e215
  14. 23 2月, 2015 2 次提交
    • D
      xfs: Remove icsb infrastructure · 5681ca40
      Dave Chinner 提交于
      Now that the in-core superblock infrastructure has been replaced with
      generic per-cpu counters, we don't need it anymore. Nuke it from
      orbit so we are sure that it won't haunt us again...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5681ca40
    • D
      xfs: use generic percpu counters for free block counter · 0d485ada
      Dave Chinner 提交于
      XFS has hand-rolled per-cpu counters for the superblock since before
      there was any generic implementation. The free block counter is
      special in that it is used for ENOSPC detection outside transaction
      contexts for for delayed allocation. This means that the counter
      needs to be accurate at zero. The current per-cpu counter code jumps
      through lots of hoops to ensure we never run past zero, but we don't
      need to make all those jumps with the generic counter
      implementation.
      
      The generic counter implementation allows us to pass a "batch"
      threshold at which the addition/subtraction to the counter value
      will be folded back into global value under lock. We can use this
      feature to reduce the batch size as we approach 0 in a very similar
      manner to the existing counters and their rebalance algorithm. If we
      use a batch size of 1 as we approach 0, then every addition and
      subtraction will be done against the global value and hence allow
      accurate detection of zero threshold crossing.
      
      Hence we can replace the handrolled, accurate-at-zero counters with
      generic percpu counters.
      
      Note: this removes just enough of the icsb infrastructure to compile
      without warnings. The rest will go in subsequent commits.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0d485ada
  15. 09 1月, 2015 1 次提交
  16. 04 12月, 2014 2 次提交
  17. 28 11月, 2014 3 次提交
  18. 02 10月, 2014 1 次提交
  19. 24 7月, 2014 1 次提交