1. 08 11月, 2016 1 次提交
  2. 20 10月, 2016 5 次提交
  3. 06 10月, 2016 4 次提交
    • D
      xfs: try other AGs to allocate a BMBT block · 90e2056d
      Darrick J. Wong 提交于
      Prior to the introduction of reflink, allocating a block and mapping
      it into a file was performed in a single transaction with a single
      block reservation, and the allocator was supposed to find enough
      blocks to allocate the extent and any BMBT blocks that might be
      necessary (unless we're low on space).
      
      However, due to the way copy on write works, allocation and mapping
      have been split into two transactions, which means that we must be
      able to handle the case where we allocate an extent for CoW but that
      AG runs out of free space before the blocks can be mapped into a file,
      and the mapping requires a new BMBT block.  When this happens, look in
      one of the other AGs for a BMBT block instead of taking the FS down.
      
      The same applies to the functions that convert a data fork to extents
      and later btree format.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      90e2056d
    • D
      xfs: create a separate cow extent size hint for the allocator · f7ca3522
      Darrick J. Wong 提交于
      Create a per-inode extent size allocator hint for copy-on-write.  This
      hint is separate from the existing extent size hint so that CoW can
      take advantage of the fragmentation-reducing properties of extent size
      hints without disabling delalloc for regular writes.
      
      The extent size hint that's fed to the allocator during a copy on
      write operation is the greater of the cowextsize and regular extsize
      hint.
      
      During reflink, if we're sharing the entire source file to the entire
      destination file and the destination file doesn't already have a
      cowextsize hint, propagate the source file's cowextsize hint to the
      destination file.
      
      Furthermore, zero the bulkstat buffer prior to setting the fields
      so that we don't copy kernel memory contents into userspace.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      f7ca3522
    • D
      xfs: store in-progress CoW allocations in the refcount btree · 174edb0e
      Darrick J. Wong 提交于
      Due to the way the CoW algorithm in XFS works, there's an interval
      during which blocks allocated to handle a CoW can be lost -- if the FS
      goes down after the blocks are allocated but before the block
      remapping takes place.  This is exacerbated by the cowextsz hint --
      allocated reservations can sit around for a while, waiting to get
      used.
      
      Since the refcount btree doesn't normally store records with refcount
      of 1, we can use it to record these in-progress extents.  In-progress
      blocks cannot be shared because they're not user-visible, so there
      shouldn't be any conflicts with other programs.  This is a better
      solution than holding EFIs during writeback because (a) EFIs can't be
      relogged currently, (b) even if they could, EFIs are bound by
      available log space, which puts an unnecessary upper bound on how much
      CoW we can have in flight, and (c) we already have a mechanism to
      track blocks.
      
      At mount time, read the refcount records and free anything we find
      with a refcount of 1 because those were in-progress when the FS went
      down.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      174edb0e
    • D
      xfs: support removing extents from CoW fork · 4862cfe8
      Darrick J. Wong 提交于
      Create a helper method to remove extents from the CoW fork without
      any of the side effects (rmapbt/bmbt updates) of the regular extent
      deletion routine.  We'll eventually use this to clear out the CoW fork
      during ioend processing.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      4862cfe8
  4. 05 10月, 2016 7 次提交
  5. 04 10月, 2016 1 次提交
  6. 26 9月, 2016 1 次提交
    • D
      xfs: remote attribute blocks aren't really userdata · 292378ed
      Dave Chinner 提交于
      When adding a new remote attribute, we write the attribute to the
      new extent before the allocation transaction is committed. This
      means we cannot reuse busy extents as that violates crash
      consistency semantics. Hence we currently treat remote attribute
      extent allocation like userdata because it has the same overwrite
      ordering constraints as userdata.
      
      Unfortunately, this also allows the allocator to incorrectly apply
      extent size hints to the remote attribute extent allocation. This
      results in interesting failures, such as transaction block
      reservation overruns and in-memory inode attribute fork corruption.
      
      To fix this, we need to separate the busy extent reuse configuration
      from the userdata configuration. This changes the definition of
      XFS_BMAPI_METADATA slightly - it now means that allocation is
      metadata and reuse of busy extents is acceptible due to the metadata
      ordering semantics of the journal. If this flag is not set, it
      means the allocation is that has unordered data writeback, and hence
      busy extent reuse is not allowed. It no longer implies the
      allocation is for user data, just that the data write will not be
      strictly ordered. This matches the semantics for both user data
      and remote attribute block allocation.
      
      As such, This patch changes the "userdata" field to a "datatype"
      field, and adds a "no busy reuse" flag to the field.
      When we detect an unordered data extent allocation, we immediately set
      the no reuse flag. We then set the "user data" flags based on the
      inode fork we are allocating the extent to. Hence we only set
      userdata flags on data fork allocations now and consider attribute
      fork remote extents to be an unordered metadata extent.
      
      The result is that remote attribute extents now have the expected
      allocation semantics, and the data fork allocation behaviour is
      completely unchanged.
      
      It should be noted that there may be other ways to fix this (e.g.
      use ordered metadata buffers for the remote attribute extent data
      write) but they are more invasive and difficult to validate both
      from a design and implementation POV. Hence this patch takes the
      simple, obvious route to fixing the problem...
      Reported-and-tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      292378ed
  7. 19 9月, 2016 2 次提交
    • C
      xfs: rewrite and optimize the delalloc write path · 51446f5b
      Christoph Hellwig 提交于
      Currently xfs_iomap_write_delay does up to lookups in the inode
      extent tree, which is rather costly especially with the new iomap
      based write path and small write sizes.
      
      But it turns out that the low-level xfs_bmap_search_extents gives us
      all the information we need in the regular delalloc buffered write
      path:
      
       - it will return us an extent covering the block we are looking up
         if it exists.  In that case we can simply return that extent to
         the caller and are done
       - it will tell us if we are beyoned the last current allocated
         block with an eof return parameter.  In that case we can create a
         delalloc reservation and use the also returned information about
         the last extent in the file as the hint to size our delalloc
         reservation.
       - it can tell us that we are writing into a hole, but that there is
         an extent beyoned this hole.  In this case we can create a
         delalloc reservation that covers the requested size (possible
         capped to the next existing allocation).
      
      All that can be done in one single routine instead of bouncing up
      and down a few layers.  This reduced the CPU overhead of the block
      mapping routines and also simplified the code a lot.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      51446f5b
    • D
      xfs: set up per-AG free space reservations · 3fd129b6
      Darrick J. Wong 提交于
      One unfortunate quirk of the reference count and reverse mapping
      btrees -- they can expand in size when blocks are written to *other*
      allocation groups if, say, one large extent becomes a lot of tiny
      extents.  Since we don't want to start throwing errors in the middle
      of CoWing, we need to reserve some blocks to handle future expansion.
      The transaction block reservation counters aren't sufficient here
      because we have to have a reserve of blocks in every AG, not just
      somewhere in the filesystem.
      
      Therefore, create two per-AG block reservation pools.  One feeds the
      AGFL so that rmapbt expansion always succeeds, and the other feeds all
      other metadata so that refcountbt expansion never fails.
      
      Use the count of how many reserved blocks we need to have on hand to
      create a virtual reservation in the AG.  Through selective clamping of
      the maximum length of allocation requests and of the length of the
      longest free extent, we can make it look like there's less free space
      in the AG unless the reservation owner is asking for blocks.
      
      In other words, play some accounting tricks in-core to make sure that
      we always have blocks available.  On the plus side, there's nothing to
      clean up if we crash, which is contrast to the strategy that the rough
      draft used (actually removing extents from the freespace btrees).
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3fd129b6
  8. 03 8月, 2016 8 次提交
  9. 21 6月, 2016 2 次提交
  10. 06 4月, 2016 1 次提交
    • C
      xfs: better xfs_trans_alloc interface · 253f4911
      Christoph Hellwig 提交于
      Merge xfs_trans_reserve and xfs_trans_alloc into a single function call
      that returns a transaction with all the required log and block reservations,
      and which allows passing transaction flags directly to avoid the cumbersome
      _xfs_trans_alloc interface.
      
      While we're at it we also get rid of the transaction type argument that has
      been superflous since we stopped supporting the non-CIL logging mode.  The
      guts of it will be removed in another patch.
      
      [dchinner: fixed transaction leak in error path in xfs_setattr_nonsize]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      253f4911
  11. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  12. 15 3月, 2016 3 次提交
    • B
      xfs: borrow indirect blocks from freed extent when available · d34999c9
      Brian Foster 提交于
      xfs_bmap_del_extent() handles extent removal from the in-core and
      on-disk extent lists. When removing a delalloc range, it updates the
      indirect block reservation appropriately based on the removal. It
      currently enforces that the new indirect block reservation is less than
      or equal to the original. This is normally the case in all situations
      except for in certain cases when the removed range creates a hole in a
      single delalloc extent, thus splitting a single delalloc extent in two.
      
      It is possible with small enough extents to split an indlen==1 extent
      into two such slightly smaller extents. This leaves one extent with 0
      indirect blocks and leads to assert failures in other areas (e.g.,
      xfs_bunmapi() if the extent happens to be removed).
      
      Update the indlen distribution code to steal blocks from the deleted
      extent, if necessary, to satisfy the worst case total indirect
      reservation for the new extents. This is safe as the caller does not
      update the fdblocks counters until the extent is removed. Blocks stolen
      in this manner simply remain accounted as allocated, having ownership
      transferred from the data extent to an indirect reservation.
      
      As a precaution, fall back to the original reservation algorithm if the
      new indlen requirement is not met and warn if we end up with extents
      without any reservation at all to detect this more easily in the future.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      d34999c9
    • B
      xfs: refactor delalloc indlen reservation split into helper · a9bd24ac
      Brian Foster 提交于
      The delayed allocation indirect reservation splitting code is not
      sufficient in some cases where a delalloc extent is split in two. In
      preparation for enhancements to this code, refactor the current indlen
      distribution algorithm into a new helper function.
      
      [dchinner: rename temp, temp2 variables]
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      a9bd24ac
    • B
      xfs: update freeblocks counter after extent deletion · b2706a05
      Brian Foster 提交于
      xfs_bunmapi() currently updates the fdblocks counter, unreserves quota,
      etc. before the extent is deleted by xfs_bmap_del_extent(). The function
      has problems dividing up the indirect reserved blocks for scenarios
      where a single delalloc extent is split in two. Particularly, there
      aren't always enough blocks reserved for multiple extents in a single
      extent reservation.
      
      The solution to this problem is to allow the extent removal code to
      steal from the deleted extent to meet indirect reservation requirements.
      Move the block of code in xfs_bmapi() that updates the fdblocks counter
      to after the call to xfs_bmap_del_extent() to allow the codepath to
      update the extent record before the free blocks are accounted. Also,
      reshuffle the code slightly so the delalloc accounting occurs near the
      xfs_bmap_del_extent() call to provide context for the comments.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      b2706a05
  13. 09 3月, 2016 1 次提交
  14. 02 3月, 2016 1 次提交
  15. 09 2月, 2016 1 次提交
  16. 11 1月, 2016 1 次提交
    • E
      xfs: eliminate committed arg from xfs_bmap_finish · f6106efa
      Eric Sandeen 提交于
      Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the
      associated comments were replicated several times across
      the attribute code, all dealing with what to do if the
      transaction was or wasn't committed.
      
      And in that replicated code, an ASSERT() test of an
      uninitialized variable occurs in several locations:
      
      	error = xfs_attr_thing(&args);
      	if (!error) {
      		error = xfs_bmap_finish(&args.trans, args.flist,
      					&committed);
      	}
      	if (error) {
      		ASSERT(committed);
      
      If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish,
      never set "committed", and then test it in the ASSERT.
      
      Fix this up by moving the committed state internal to xfs_bmap_finish,
      and add a new inode argument.  If an inode is passed in, it is passed
      through to __xfs_trans_roll() and joined to the transaction there if
      the transaction was committed.
      
      xfs_qm_dqalloc() was a little unique in that it called bjoin rather
      than ijoin, but as Dave points out we can detect the committed state
      but checking whether (*tpp != tp).
      
      Addresses-Coverity-Id: 102360
      Addresses-Coverity-Id: 102361
      Addresses-Coverity-Id: 102363
      Addresses-Coverity-Id: 102364
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f6106efa