1. 26 9月, 2016 1 次提交
    • D
      xfs: remote attribute blocks aren't really userdata · 292378ed
      Dave Chinner 提交于
      When adding a new remote attribute, we write the attribute to the
      new extent before the allocation transaction is committed. This
      means we cannot reuse busy extents as that violates crash
      consistency semantics. Hence we currently treat remote attribute
      extent allocation like userdata because it has the same overwrite
      ordering constraints as userdata.
      
      Unfortunately, this also allows the allocator to incorrectly apply
      extent size hints to the remote attribute extent allocation. This
      results in interesting failures, such as transaction block
      reservation overruns and in-memory inode attribute fork corruption.
      
      To fix this, we need to separate the busy extent reuse configuration
      from the userdata configuration. This changes the definition of
      XFS_BMAPI_METADATA slightly - it now means that allocation is
      metadata and reuse of busy extents is acceptible due to the metadata
      ordering semantics of the journal. If this flag is not set, it
      means the allocation is that has unordered data writeback, and hence
      busy extent reuse is not allowed. It no longer implies the
      allocation is for user data, just that the data write will not be
      strictly ordered. This matches the semantics for both user data
      and remote attribute block allocation.
      
      As such, This patch changes the "userdata" field to a "datatype"
      field, and adds a "no busy reuse" flag to the field.
      When we detect an unordered data extent allocation, we immediately set
      the no reuse flag. We then set the "user data" flags based on the
      inode fork we are allocating the extent to. Hence we only set
      userdata flags on data fork allocations now and consider attribute
      fork remote extents to be an unordered metadata extent.
      
      The result is that remote attribute extents now have the expected
      allocation semantics, and the data fork allocation behaviour is
      completely unchanged.
      
      It should be noted that there may be other ways to fix this (e.g.
      use ordered metadata buffers for the remote attribute extent data
      write) but they are more invasive and difficult to validate both
      from a design and implementation POV. Hence this patch takes the
      simple, obvious route to fixing the problem...
      Reported-and-tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      292378ed
  2. 03 8月, 2016 6 次提交
  3. 21 6月, 2016 3 次提交
  4. 01 6月, 2016 2 次提交
  5. 19 5月, 2016 1 次提交
  6. 06 4月, 2016 1 次提交
    • C
      xfs: better xfs_trans_alloc interface · 253f4911
      Christoph Hellwig 提交于
      Merge xfs_trans_reserve and xfs_trans_alloc into a single function call
      that returns a transaction with all the required log and block reservations,
      and which allows passing transaction flags directly to avoid the cumbersome
      _xfs_trans_alloc interface.
      
      While we're at it we also get rid of the transaction type argument that has
      been superflous since we stopped supporting the non-CIL logging mode.  The
      guts of it will be removed in another patch.
      
      [dchinner: fixed transaction leak in error path in xfs_setattr_nonsize]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      253f4911
  7. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  8. 28 2月, 2016 1 次提交
    • R
      dax: give DAX clearing code correct bdev · 20a90f58
      Ross Zwisler 提交于
      dax_clear_blocks() needs a valid struct block_device and previously it
      was using inode->i_sb->s_bdev in all cases.  This is correct for normal
      inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
      DAX raw block devices and for XFS real-time devices.
      
      Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
      its arguments to take a bdev and a sector instead of an inode and a
      block.  This better reflects what the function does, and it allows the
      filesystem and raw block device code to pass in an appropriate struct
      block_device.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20a90f58
  9. 09 2月, 2016 1 次提交
  10. 08 2月, 2016 1 次提交
  11. 11 1月, 2016 1 次提交
    • E
      xfs: eliminate committed arg from xfs_bmap_finish · f6106efa
      Eric Sandeen 提交于
      Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the
      associated comments were replicated several times across
      the attribute code, all dealing with what to do if the
      transaction was or wasn't committed.
      
      And in that replicated code, an ASSERT() test of an
      uninitialized variable occurs in several locations:
      
      	error = xfs_attr_thing(&args);
      	if (!error) {
      		error = xfs_bmap_finish(&args.trans, args.flist,
      					&committed);
      	}
      	if (error) {
      		ASSERT(committed);
      
      If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish,
      never set "committed", and then test it in the ASSERT.
      
      Fix this up by moving the committed state internal to xfs_bmap_finish,
      and add a new inode argument.  If an inode is passed in, it is passed
      through to __xfs_trans_roll() and joined to the transaction there if
      the transaction was committed.
      
      xfs_qm_dqalloc() was a little unique in that it called bjoin rather
      than ijoin, but as Dave points out we can detect the committed state
      but checking whether (*tpp != tp).
      
      Addresses-Coverity-Id: 102360
      Addresses-Coverity-Id: 102361
      Addresses-Coverity-Id: 102363
      Addresses-Coverity-Id: 102364
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f6106efa
  12. 03 11月, 2015 1 次提交
    • D
      xfs: introduce BMAPI_ZERO for allocating zeroed extents · 3fbbbea3
      Dave Chinner 提交于
      To enable DAX to do atomic allocation of zeroed extents, we need to
      drive the block zeroing deep into the allocator. Because
      xfs_bmapi_write() can return merged extents on allocation that were
      only partially allocated (i.e. requested range spans allocated and
      hole regions, allocation into the hole was contiguous), we cannot
      zero the extent returned from xfs_bmapi_write() as that can
      overwrite existing data with zeros.
      
      Hence we have to drive the extent zeroing into the allocation code,
      prior to where we merge the extents into the BMBT and return the
      resultant map. This means we need to propagate this need down to
      the xfs_alloc_vextent() and issue the block zeroing at this point.
      
      While this functionality is being introduced for DAX, there is no
      reason why it is specific to DAX - we can per-zero blocks during the
      allocation transaction on any type of device. It's just slow (and
      usually slower than unwritten allocation and conversion) on
      traditional block devices so doesn't tend to get used. We can,
      however, hook hardware zeroing optimisations via sb_issue_zeroout()
      to this operation, so it may be useful in future and hence the
      "allocate zeroed blocks" API needs to be implementation neutral.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3fbbbea3
  13. 12 10月, 2015 1 次提交
    • B
      xfs: pass total block res. as total xfs_bmapi_write() parameter · dbd5c8c9
      Brian Foster 提交于
      The total field from struct xfs_alloc_arg is a bit of an unknown
      commodity. It is documented as the total block requirement for the
      transaction and is used in this manner from most call sites by virtue of
      passing the total block reservation of the transaction associated with
      an allocation. Several xfs_bmapi_write() callers pass hardcoded values
      of 0 or 1 for the total block requirement, which is a historical oddity
      without any clear reasoning.
      
      The xfs_iomap_write_direct() caller, for example, passes 0 for the total
      block requirement. This has been determined to cause problems in the
      form of ABBA deadlocks of AGF buffers due to incorrect AG selection in
      the block allocator. Specifically, the xfs_alloc_space_available()
      function incorrectly selects an AG that doesn't actually have sufficient
      space for the allocation. This occurs because the args.total field is 0
      and thus the remaining free space check on the AG doesn't actually
      consider the size of the allocation request. This locks the AGF buffer,
      the allocation attempt proceeds and ultimately fails (in
      xfs_alloc_fix_minleft()), and xfs_alloc_vexent() moves on to the next
      AG. In turn, this can lead to incorrect AG locking order (if the
      allocator wraps around, attempting to lock AG 0 after acquiring AG N)
      and thus deadlock if racing with another operation. This problem has
      been reproduced via generic/299 on smallish (1GB) ramdisk test devices.
      
      To avoid this problem, replace the undocumented hardcoded total
      parameters from the iomap and utility callers to pass the block
      reservation used for the associated transaction. This is consistent with
      other xfs_bmapi_write() callers throughout XFS. The assumption is that
      the total field allows the selection of an AG that can handle the entire
      operation rather than simply the allocation/range being requested (e.g.,
      resulting btree splits, etc.). This addresses the aforementioned
      generic/299 hang by ensuring AG selection only occurs when the
      allocation can be satisfied by the AG.
      Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      dbd5c8c9
  14. 19 8月, 2015 3 次提交
    • B
      xfs: add missing bmap cancel calls in error paths · d4a97a04
      Brian Foster 提交于
      If a failure occurs after the bmap free list is populated and before
      xfs_bmap_finish() completes successfully (which returns a partial
      list on failure), the bmap free list must be cancelled. Otherwise,
      the extent items on the list are never freed and a memory leak
      occurs.
      
      Several random error paths throughout the code suffer this problem.
      Fix these up such that xfs_bmap_cancel() is always called on error.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      d4a97a04
    • B
      xfs: ensure EFD trans aborts on log recovery extent free failure · 6bc43af3
      Brian Foster 提交于
      Log recovery attempts to free extents with leftover EFIs in the AIL
      after initial processing. If the extent free fails (e.g., due to
      unrelated fs corruption), the transaction is cancelled, though it
      might not be dirtied at the time. If this is the case, the EFD does
      not abort and thus does not release the EFI. This can lead to hangs
      as the EFI pins the AIL.
      
      Update xlog_recover_process_efi() to log the EFD in the transaction
      before xfs_free_extent() errors are handled to ensure the
      transaction is dirty, aborts the EFD and releases the EFI on error.
      Since this is a requirement for EFD processing (and consistent with
      xfs_bmap_finish()), update the EFD logging helper to do the extent
      free and unconditionally log the EFD. This encodes the required EFD
      logging behavior into the helper and reduces the likelihood of
      errors down the road.
      
      [dchinner: re-add xfs_alloc.h to xfs_log_recover.c to fix build
       failure.]
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      6bc43af3
    • B
      xfs: fix efi/efd error handling to avoid fs shutdown hangs · 8d99fe92
      Brian Foster 提交于
      Freeing an extent in XFS involves logging an EFI (extent free
      intention), freeing the actual extent, and logging an EFD (extent
      free done). The EFI object is created with a reference count of 2:
      one for the current transaction and one for the subsequently created
      EFD. Under normal circumstances, the first reference is dropped when
      the EFI is unpinned and the second reference is dropped when the EFD
      is committed to the on-disk log.
      
      In event of errors or filesystem shutdown, there are various
      potential cleanup scenarios depending on the state of the EFI/EFD.
      The cleanup scenarios are confusing and racy, as demonstrated by the
      following test sequence:
      
      	# mount $dev $mnt
      	# fsstress -d $mnt -n 99999 -p 16 -z -f fallocate=1 \
      		-f punch=1 -f creat=1 -f unlink=1 &
      	# sleep 5
      	# killall -9 fsstress; wait
      	# godown -f $mnt
      	# umount
      
      ... in which the final umount can hang due to the AIL being pinned
      indefinitely by one or more EFI items. This can occur due to several
      conditions. For example, if the shutdown occurs after the EFI is
      committed to the on-disk log and the EFD committed to the CIL, but
      before the EFD committed to the log, the EFD iop_committed() abort
      handler does not drop its reference to the EFI. Alternatively,
      manual error injection in the xfs_bmap_finish() codepath shows that
      if an error occurs after the EFI transaction is committed but before
      the EFD is constructed and logged, the EFI is never released from
      the AIL.
      
      Update the EFI/EFD item handling code to use a more straightforward
      and reliable approach to error handling. If an error occurs after
      the EFI transaction is committed and before the EFD is constructed,
      release the EFI explicitly from xfs_bmap_finish(). If the EFI
      transaction is cancelled, release the EFI in the unlock handler.
      
      Once the EFD is constructed, it is responsible for releasing the EFI
      under any circumstances (including whether the EFI item aborts due
      to log I/O error). Update the EFD item handlers to release the EFI
      if the transaction is cancelled or aborts due to log I/O error.
      Finally, update xfs_bmap_finish() to log at least one EFD extent to
      the transaction before xfs_free_extent() errors are handled to
      ensure the transaction is dirty and EFD item error handling is
      triggered.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8d99fe92
  15. 04 6月, 2015 4 次提交
  16. 13 4月, 2015 1 次提交
  17. 25 3月, 2015 1 次提交
  18. 23 2月, 2015 1 次提交
  19. 28 11月, 2014 3 次提交
  20. 30 10月, 2014 1 次提交
    • B
      xfs: rework zero range to prevent invalid i_size updates · 5d11fb4b
      Brian Foster 提交于
      The zero range operation is analogous to fallocate with the exception of
      converting the range to zeroes. E.g., it attempts to allocate zeroed
      blocks over the range specified by the caller. The XFS implementation
      kills all delalloc blocks currently over the aligned range, converts the
      range to allocated zero blocks (unwritten extents) and handles the
      partial pages at the ends of the range by sending writes through the
      pagecache.
      
      The current implementation suffers from several problems associated with
      inode size. If the aligned range covers an extending I/O, said I/O is
      discarded and an inode size update from a previous write never makes it
      to disk. Further, if an unaligned zero range extends beyond eof, the
      page write induced for the partial end page can itself increase the
      inode size, even if the zero range request is not supposed to update
      i_size (via KEEP_SIZE, similar to an fallocate beyond EOF).
      
      The latter behavior not only incorrectly increases the inode size, but
      can lead to stray delalloc blocks on the inode. Typically, post-eof
      preallocation blocks are either truncated on release or inode eviction
      or explicitly written to by xfs_zero_eof() on natural file size
      extension. If the inode size increases due to zero range, however,
      associated blocks leak into the address space having never been
      converted or mapped to pagecache pages. A direct I/O to such an
      uncovered range cannot convert the extent via writeback and will BUG().
      For example:
      
      $ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file>
      ...
      $ xfs_io -d -c "pread 128k 128k" <file>
      <BUG>
      
      If the entire delalloc extent happens to not have page coverage
      whatsoever (e.g., delalloc conversion couldn't find a large enough free
      space extent), even a full file writeback won't convert what's left of
      the extent and we'll assert on inode eviction.
      
      Rework xfs_zero_file_space() to avoid buffered I/O for partial pages.
      Use the existing hole punch and prealloc mechanisms as primitives for
      zero range. This implementation is not efficient nor ideal as we
      writeback dirty data over the range and remove existing extents rather
      than convert to unwrittern. The former writeback, however, is currently
      the only mechanism available to ensure consistency between pagecache and
      extent state. Even a pagecache truncate/delalloc punch prior to hole
      punch has lead to inconsistencies due to racing with writeback.
      
      This provides a consistent, correct implementation of zero range that
      survives fsstress/fsx testing without assert failures. The
      implementation can be optimized from this point forward once the
      fundamental issue of pagecache and delalloc extent state consistency is
      addressed.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      5d11fb4b
  21. 02 10月, 2014 3 次提交
    • B
      xfs: flush the range before zero range conversion · da5f1096
      Brian Foster 提交于
      XFS currently discards delalloc blocks within the target range of a
      zero range request. Unaligned start and end offsets are zeroed
      through the page cache and the internal, aligned blocks are
      converted to unwritten extents.
      
      If EOF is page aligned and covered by a delayed allocation extent.
      The inode size is not updated until I/O completion. If a zero range
      request discards a delalloc range that covers page aligned EOF as
      such, the inode size update never occurs. For example:
      
      $ rm -f /mnt/file
      $ xfs_io -fc "pwrite 0 64k" -c "zero 60k 4k" /mnt/file
      $ stat -c "%s" /mnt/file
      65536
      $ umount /mnt
      $ mount <dev> /mnt
      $ stat -c "%s" /mnt/file
      61440
      
      Update xfs_zero_file_space() to flush the range rather than discard
      delalloc blocks to ensure that inode size updates occur
      appropriately.
      
      [dchinner: Note that this is really a workaround to avoid the
      underlying problems. More work is needed (and ongoing) to fix those
      issues so this fix is being added as a temporary stop-gap measure. ]
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      da5f1096
    • C
      xfs: simplify xfs_zero_remaining_bytes · 8c156125
      Christoph Hellwig 提交于
      xfs_zero_remaining_bytes() open codes a log of buffer manupulations
      to do a read forllowed by a write. It can simply be replaced by an
      uncached read followed by a xfs_bwrite() call.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8c156125
    • D
      xfs: introduce xfs_buf_submit[_wait] · 595bff75
      Dave Chinner 提交于
      There is a lot of cookie-cutter code that looks like:
      
      	if (shutdown)
      		handle buffer error
      	xfs_buf_iorequest(bp)
      	error = xfs_buf_iowait(bp)
      	if (error)
      		handle buffer error
      
      spread through XFS. There's significant complexity now in
      xfs_buf_iorequest() to specifically handle this sort of synchronous
      IO pattern, but there's all sorts of nasty surprises in different
      error handling code dependent on who owns the buffer references and
      the locks.
      
      Pull this pattern into a single helper, where we can hide all the
      synchronous IO warts and hence make the error handling for all the
      callers much saner. This removes the need for a special extra
      reference to protect IO completion processing, as we can now hold a
      single reference across dispatch and waiting, simplifying the sync
      IO smeantics and error handling.
      
      In doing this, also rename xfs_buf_iorequest to xfs_buf_submit and
      make it explicitly handle on asynchronous IO. This forces all users
      to be switched specifically to one interface or the other and
      removes any ambiguity between how the interfaces are to be used. It
      also means that xfs_buf_iowait() goes away.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      595bff75
  22. 23 9月, 2014 2 次提交