1. 08 11月, 2016 2 次提交
    • E
      xfs: provide helper for counting extents from if_bytes · 5d829300
      Eric Sandeen 提交于
      The open-coded pattern:
      
      ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t)
      
      is all over the xfs code; provide a new helper
      xfs_iext_count(ifp) to count the number of inline extents
      in an inode fork.
      
      [dchinner: pick up several missed conversions]
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5d829300
    • E
      xfs: fix up xfs_swap_extent_forks inline extent handling · 4dfce57d
      Eric Sandeen 提交于
      There have been several reports over the years of NULL pointer
      dereferences in xfs_trans_log_inode during xfs_fsr processes,
      when the process is doing an fput and tearing down extents
      on the temporary inode, something like:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      PID: 29439  TASK: ffff880550584fa0  CPU: 6   COMMAND: "xfs_fsr"
          [exception RIP: xfs_trans_log_inode+0x10]
       #9 [ffff8800a57bbbe0] xfs_bunmapi at ffffffffa037398e [xfs]
      #10 [ffff8800a57bbce8] xfs_itruncate_extents at ffffffffa0391b29 [xfs]
      #11 [ffff8800a57bbd88] xfs_inactive_truncate at ffffffffa0391d0c [xfs]
      #12 [ffff8800a57bbdb8] xfs_inactive at ffffffffa0392508 [xfs]
      #13 [ffff8800a57bbdd8] xfs_fs_evict_inode at ffffffffa035907e [xfs]
      #14 [ffff8800a57bbe00] evict at ffffffff811e1b67
      #15 [ffff8800a57bbe28] iput at ffffffff811e23a5
      #16 [ffff8800a57bbe58] dentry_kill at ffffffff811dcfc8
      #17 [ffff8800a57bbe88] dput at ffffffff811dd06c
      #18 [ffff8800a57bbea8] __fput at ffffffff811c823b
      #19 [ffff8800a57bbef0] ____fput at ffffffff811c846e
      #20 [ffff8800a57bbf00] task_work_run at ffffffff81093b27
      #21 [ffff8800a57bbf30] do_notify_resume at ffffffff81013b0c
      #22 [ffff8800a57bbf50] int_signal at ffffffff8161405d
      
      As it turns out, this is because the i_itemp pointer, along
      with the d_ops pointer, has been overwritten with zeros
      when we tear down the extents during truncate.  When the in-core
      inode fork on the temporary inode used by xfs_fsr was originally
      set up during the extent swap, we mistakenly looked at di_nextents
      to determine whether all extents fit inline, but this misses extents
      generated by speculative preallocation; we should be using if_bytes
      instead.
      
      This mistake corrupts the in-memory inode, and code in
      xfs_iext_remove_inline eventually gets bad inputs, causing
      it to memmove and memset incorrect ranges; this became apparent
      because the two values in ifp->if_u2.if_inline_ext[1] contained
      what should have been in d_ops and i_itemp; they were memmoved due
      to incorrect array indexing and then the original locations
      were zeroed with memset, again due to an array overrun.
      
      Fix this by properly using i_df.if_bytes to determine the number
      of extents, not di_nextents.
      
      Thanks to dchinner for looking at this with me and spotting the
      root cause.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      4dfce57d
  2. 06 10月, 2016 7 次提交
  3. 26 9月, 2016 1 次提交
    • D
      xfs: remote attribute blocks aren't really userdata · 292378ed
      Dave Chinner 提交于
      When adding a new remote attribute, we write the attribute to the
      new extent before the allocation transaction is committed. This
      means we cannot reuse busy extents as that violates crash
      consistency semantics. Hence we currently treat remote attribute
      extent allocation like userdata because it has the same overwrite
      ordering constraints as userdata.
      
      Unfortunately, this also allows the allocator to incorrectly apply
      extent size hints to the remote attribute extent allocation. This
      results in interesting failures, such as transaction block
      reservation overruns and in-memory inode attribute fork corruption.
      
      To fix this, we need to separate the busy extent reuse configuration
      from the userdata configuration. This changes the definition of
      XFS_BMAPI_METADATA slightly - it now means that allocation is
      metadata and reuse of busy extents is acceptible due to the metadata
      ordering semantics of the journal. If this flag is not set, it
      means the allocation is that has unordered data writeback, and hence
      busy extent reuse is not allowed. It no longer implies the
      allocation is for user data, just that the data write will not be
      strictly ordered. This matches the semantics for both user data
      and remote attribute block allocation.
      
      As such, This patch changes the "userdata" field to a "datatype"
      field, and adds a "no busy reuse" flag to the field.
      When we detect an unordered data extent allocation, we immediately set
      the no reuse flag. We then set the "user data" flags based on the
      inode fork we are allocating the extent to. Hence we only set
      userdata flags on data fork allocations now and consider attribute
      fork remote extents to be an unordered metadata extent.
      
      The result is that remote attribute extents now have the expected
      allocation semantics, and the data fork allocation behaviour is
      completely unchanged.
      
      It should be noted that there may be other ways to fix this (e.g.
      use ordered metadata buffers for the remote attribute extent data
      write) but they are more invasive and difficult to validate both
      from a design and implementation POV. Hence this patch takes the
      simple, obvious route to fixing the problem...
      Reported-and-tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      292378ed
  4. 03 8月, 2016 6 次提交
  5. 21 6月, 2016 3 次提交
  6. 01 6月, 2016 2 次提交
  7. 19 5月, 2016 1 次提交
  8. 06 4月, 2016 1 次提交
    • C
      xfs: better xfs_trans_alloc interface · 253f4911
      Christoph Hellwig 提交于
      Merge xfs_trans_reserve and xfs_trans_alloc into a single function call
      that returns a transaction with all the required log and block reservations,
      and which allows passing transaction flags directly to avoid the cumbersome
      _xfs_trans_alloc interface.
      
      While we're at it we also get rid of the transaction type argument that has
      been superflous since we stopped supporting the non-CIL logging mode.  The
      guts of it will be removed in another patch.
      
      [dchinner: fixed transaction leak in error path in xfs_setattr_nonsize]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      253f4911
  9. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  10. 28 2月, 2016 1 次提交
    • R
      dax: give DAX clearing code correct bdev · 20a90f58
      Ross Zwisler 提交于
      dax_clear_blocks() needs a valid struct block_device and previously it
      was using inode->i_sb->s_bdev in all cases.  This is correct for normal
      inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
      DAX raw block devices and for XFS real-time devices.
      
      Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
      its arguments to take a bdev and a sector instead of an inode and a
      block.  This better reflects what the function does, and it allows the
      filesystem and raw block device code to pass in an appropriate struct
      block_device.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20a90f58
  11. 09 2月, 2016 1 次提交
  12. 08 2月, 2016 1 次提交
  13. 11 1月, 2016 1 次提交
    • E
      xfs: eliminate committed arg from xfs_bmap_finish · f6106efa
      Eric Sandeen 提交于
      Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the
      associated comments were replicated several times across
      the attribute code, all dealing with what to do if the
      transaction was or wasn't committed.
      
      And in that replicated code, an ASSERT() test of an
      uninitialized variable occurs in several locations:
      
      	error = xfs_attr_thing(&args);
      	if (!error) {
      		error = xfs_bmap_finish(&args.trans, args.flist,
      					&committed);
      	}
      	if (error) {
      		ASSERT(committed);
      
      If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish,
      never set "committed", and then test it in the ASSERT.
      
      Fix this up by moving the committed state internal to xfs_bmap_finish,
      and add a new inode argument.  If an inode is passed in, it is passed
      through to __xfs_trans_roll() and joined to the transaction there if
      the transaction was committed.
      
      xfs_qm_dqalloc() was a little unique in that it called bjoin rather
      than ijoin, but as Dave points out we can detect the committed state
      but checking whether (*tpp != tp).
      
      Addresses-Coverity-Id: 102360
      Addresses-Coverity-Id: 102361
      Addresses-Coverity-Id: 102363
      Addresses-Coverity-Id: 102364
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f6106efa
  14. 03 11月, 2015 1 次提交
    • D
      xfs: introduce BMAPI_ZERO for allocating zeroed extents · 3fbbbea3
      Dave Chinner 提交于
      To enable DAX to do atomic allocation of zeroed extents, we need to
      drive the block zeroing deep into the allocator. Because
      xfs_bmapi_write() can return merged extents on allocation that were
      only partially allocated (i.e. requested range spans allocated and
      hole regions, allocation into the hole was contiguous), we cannot
      zero the extent returned from xfs_bmapi_write() as that can
      overwrite existing data with zeros.
      
      Hence we have to drive the extent zeroing into the allocation code,
      prior to where we merge the extents into the BMBT and return the
      resultant map. This means we need to propagate this need down to
      the xfs_alloc_vextent() and issue the block zeroing at this point.
      
      While this functionality is being introduced for DAX, there is no
      reason why it is specific to DAX - we can per-zero blocks during the
      allocation transaction on any type of device. It's just slow (and
      usually slower than unwritten allocation and conversion) on
      traditional block devices so doesn't tend to get used. We can,
      however, hook hardware zeroing optimisations via sb_issue_zeroout()
      to this operation, so it may be useful in future and hence the
      "allocate zeroed blocks" API needs to be implementation neutral.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3fbbbea3
  15. 12 10月, 2015 1 次提交
    • B
      xfs: pass total block res. as total xfs_bmapi_write() parameter · dbd5c8c9
      Brian Foster 提交于
      The total field from struct xfs_alloc_arg is a bit of an unknown
      commodity. It is documented as the total block requirement for the
      transaction and is used in this manner from most call sites by virtue of
      passing the total block reservation of the transaction associated with
      an allocation. Several xfs_bmapi_write() callers pass hardcoded values
      of 0 or 1 for the total block requirement, which is a historical oddity
      without any clear reasoning.
      
      The xfs_iomap_write_direct() caller, for example, passes 0 for the total
      block requirement. This has been determined to cause problems in the
      form of ABBA deadlocks of AGF buffers due to incorrect AG selection in
      the block allocator. Specifically, the xfs_alloc_space_available()
      function incorrectly selects an AG that doesn't actually have sufficient
      space for the allocation. This occurs because the args.total field is 0
      and thus the remaining free space check on the AG doesn't actually
      consider the size of the allocation request. This locks the AGF buffer,
      the allocation attempt proceeds and ultimately fails (in
      xfs_alloc_fix_minleft()), and xfs_alloc_vexent() moves on to the next
      AG. In turn, this can lead to incorrect AG locking order (if the
      allocator wraps around, attempting to lock AG 0 after acquiring AG N)
      and thus deadlock if racing with another operation. This problem has
      been reproduced via generic/299 on smallish (1GB) ramdisk test devices.
      
      To avoid this problem, replace the undocumented hardcoded total
      parameters from the iomap and utility callers to pass the block
      reservation used for the associated transaction. This is consistent with
      other xfs_bmapi_write() callers throughout XFS. The assumption is that
      the total field allows the selection of an AG that can handle the entire
      operation rather than simply the allocation/range being requested (e.g.,
      resulting btree splits, etc.). This addresses the aforementioned
      generic/299 hang by ensuring AG selection only occurs when the
      allocation can be satisfied by the AG.
      Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      dbd5c8c9
  16. 19 8月, 2015 3 次提交
    • B
      xfs: add missing bmap cancel calls in error paths · d4a97a04
      Brian Foster 提交于
      If a failure occurs after the bmap free list is populated and before
      xfs_bmap_finish() completes successfully (which returns a partial
      list on failure), the bmap free list must be cancelled. Otherwise,
      the extent items on the list are never freed and a memory leak
      occurs.
      
      Several random error paths throughout the code suffer this problem.
      Fix these up such that xfs_bmap_cancel() is always called on error.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      d4a97a04
    • B
      xfs: ensure EFD trans aborts on log recovery extent free failure · 6bc43af3
      Brian Foster 提交于
      Log recovery attempts to free extents with leftover EFIs in the AIL
      after initial processing. If the extent free fails (e.g., due to
      unrelated fs corruption), the transaction is cancelled, though it
      might not be dirtied at the time. If this is the case, the EFD does
      not abort and thus does not release the EFI. This can lead to hangs
      as the EFI pins the AIL.
      
      Update xlog_recover_process_efi() to log the EFD in the transaction
      before xfs_free_extent() errors are handled to ensure the
      transaction is dirty, aborts the EFD and releases the EFI on error.
      Since this is a requirement for EFD processing (and consistent with
      xfs_bmap_finish()), update the EFD logging helper to do the extent
      free and unconditionally log the EFD. This encodes the required EFD
      logging behavior into the helper and reduces the likelihood of
      errors down the road.
      
      [dchinner: re-add xfs_alloc.h to xfs_log_recover.c to fix build
       failure.]
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      6bc43af3
    • B
      xfs: fix efi/efd error handling to avoid fs shutdown hangs · 8d99fe92
      Brian Foster 提交于
      Freeing an extent in XFS involves logging an EFI (extent free
      intention), freeing the actual extent, and logging an EFD (extent
      free done). The EFI object is created with a reference count of 2:
      one for the current transaction and one for the subsequently created
      EFD. Under normal circumstances, the first reference is dropped when
      the EFI is unpinned and the second reference is dropped when the EFD
      is committed to the on-disk log.
      
      In event of errors or filesystem shutdown, there are various
      potential cleanup scenarios depending on the state of the EFI/EFD.
      The cleanup scenarios are confusing and racy, as demonstrated by the
      following test sequence:
      
      	# mount $dev $mnt
      	# fsstress -d $mnt -n 99999 -p 16 -z -f fallocate=1 \
      		-f punch=1 -f creat=1 -f unlink=1 &
      	# sleep 5
      	# killall -9 fsstress; wait
      	# godown -f $mnt
      	# umount
      
      ... in which the final umount can hang due to the AIL being pinned
      indefinitely by one or more EFI items. This can occur due to several
      conditions. For example, if the shutdown occurs after the EFI is
      committed to the on-disk log and the EFD committed to the CIL, but
      before the EFD committed to the log, the EFD iop_committed() abort
      handler does not drop its reference to the EFI. Alternatively,
      manual error injection in the xfs_bmap_finish() codepath shows that
      if an error occurs after the EFI transaction is committed but before
      the EFD is constructed and logged, the EFI is never released from
      the AIL.
      
      Update the EFI/EFD item handling code to use a more straightforward
      and reliable approach to error handling. If an error occurs after
      the EFI transaction is committed and before the EFD is constructed,
      release the EFI explicitly from xfs_bmap_finish(). If the EFI
      transaction is cancelled, release the EFI in the unlock handler.
      
      Once the EFD is constructed, it is responsible for releasing the EFI
      under any circumstances (including whether the EFI item aborts due
      to log I/O error). Update the EFD item handlers to release the EFI
      if the transaction is cancelled or aborts due to log I/O error.
      Finally, update xfs_bmap_finish() to log at least one EFD extent to
      the transaction before xfs_free_extent() errors are handled to
      ensure the transaction is dirty and EFD item error handling is
      triggered.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8d99fe92
  17. 04 6月, 2015 4 次提交
  18. 13 4月, 2015 1 次提交
  19. 25 3月, 2015 1 次提交
  20. 23 2月, 2015 1 次提交