1. 17 5月, 2017 2 次提交
  2. 26 4月, 2017 1 次提交
    • E
      xfs: more do_div cleanups · 4f1adf33
      Eric Sandeen 提交于
      On some architectures do_div does the pointer compare
      trick to make sure that we've sent it an unsigned 64-bit
      number.  (Why unsigned?  I don't know.)
      
      Fix up the few places that squawk about this; in
      xfs_bmap_wants_extents() we just used a bare int64_t so change
      that to unsigned.
      
      In xfs_adjust_extent_unmap_boundaries() all we wanted was the
      mod, and we have an xfs-specific function to handle that w/o
      side effects, which includes proper casting for do_div.
      
      In xfs_daddr_to_ag[b]no, we were using the wrong type anyway;
      XFS_BB_TO_FSBT returns a block in the filesystem, so use
      xfs_rfsblock_t not xfs_daddr_t, and gain the unsignedness
      from that type as a bonus.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      4f1adf33
  3. 12 4月, 2017 1 次提交
  4. 09 4月, 2017 1 次提交
  5. 04 4月, 2017 3 次提交
  6. 18 2月, 2017 1 次提交
  7. 17 2月, 2017 1 次提交
    • B
      xfs: don't reserve blocks for right shift transactions · 48af96ab
      Brian Foster 提交于
      The block reservation for the transaction allocated in
      xfs_shift_file_space() is an artifact of the original collapse range
      support. It exists to handle the case where a collapse range occurs,
      the initial extent is left shifted into a location that forms a
      contiguous boundary with the previous extent and thus the extents
      are merged. This code was subsequently refactored and reused for
      insert range (right shift) support.
      
      If an insert range occurs under low free space conditions, the
      extent at the starting offset is split before the first shift
      transaction is allocated. If the block reservation fails, this
      leaves separate, but contiguous extents around in the inode. While
      not a fatal problem, this is unexpected and will flag a warning on
      subsequent insert range operations on the inode. This problem has
      been reproduce intermittently by generic/270 running against a
      ramdisk device.
      
      Since right shift does not create new extent boundaries in the
      inode, a block reservation for extent merge is unnecessary. Update
      xfs_shift_file_space() to conditionally reserve fs blocks for left
      shift transactions only. This avoids the warning reproduced by
      generic/270.
      Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      48af96ab
  8. 31 1月, 2017 3 次提交
    • E
      xfs: remove unused full argument from bmap · 1dbba086
      Eric Sandeen 提交于
      The "full" argument was used only by the fiemap formatter,
      which is now gone with the iomap updates.
      
      Remove the unused arg.
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NAlex Elder <elder@linaro.org>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      1dbba086
    • B
      xfs: fix eofblocks race with file extending async dio writes · e4229d6b
      Brian Foster 提交于
      It's possible for post-eof blocks to end up being used for direct I/O
      writes. dio write performs an upfront unwritten extent allocation, sends
      the dio and then updates the inode size (if necessary) on write
      completion. If a file release occurs while a file extending dio write is
      in flight, it is possible to mistake the post-eof blocks for speculative
      preallocation and incorrectly truncate them from the inode. This means
      that the resulting dio write completion can discover a hole and allocate
      new blocks rather than perform unwritten extent conversion.
      
      This requires a strange mix of I/O and is thus not likely to reproduce
      in real world workloads. It is intermittently reproduced by generic/299.
      The error manifests as an assert failure due to transaction overrun
      because the aforementioned write completion transaction has only
      reserved enough blocks for btree operations:
      
        XFS: Assertion failed: tp->t_blk_res_used <= tp->t_blk_res, \
         file: fs/xfs//xfs_trans.c, line: 309
      
      The root cause is that xfs_free_eofblocks() uses i_size to truncate
      post-eof blocks from the inode, but async, file extending direct writes
      do not update i_size until write completion, long after inode locks are
      dropped. Therefore, xfs_free_eofblocks() effectively truncates the inode
      to the incorrect size.
      
      Update xfs_free_eofblocks() to serialize against dio similar to how
      extending writes are serialized against i_size updates before post-eof
      block zeroing. Specifically, wait on dio while under the iolock. This
      ensures that dio write completions have updated i_size before post-eof
      blocks are processed.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      e4229d6b
    • B
      xfs: pull up iolock from xfs_free_eofblocks() · a36b9261
      Brian Foster 提交于
      xfs_free_eofblocks() requires the IOLOCK_EXCL lock, but is called from
      different contexts where the lock may or may not be held. The
      need_iolock parameter exists for this reason, to indicate whether
      xfs_free_eofblocks() must acquire the iolock itself before it can
      proceed.
      
      This is ugly and confusing. Simplify the semantics of
      xfs_free_eofblocks() to require the caller to acquire the iolock
      appropriately and kill the need_iolock parameter. While here, the mp
      param can be removed as well as the xfs_mount is accessible from the
      xfs_inode structure. This patch does not change behavior.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      a36b9261
  9. 27 1月, 2017 1 次提交
    • D
      xfs: fix bmv_count confusion w/ shared extents · c364b6d0
      Darrick J. Wong 提交于
      In a bmapx call, bmv_count is the total size of the array, including the
      zeroth element that userspace uses to supply the search key.  The output
      array starts at offset 1 so that we can set up the user for the next
      invocation.  Since we now can split an extent into multiple bmap records
      due to shared/unshared status, we have to be careful that we don't
      overflow the output array.
      
      In the original patch f86f4037 ("xfs: teach get_bmapx about shared
      extents and the CoW fork") I used cur_ext (the output index) to check
      for overflows, albeit with an off-by-one error.  Since nexleft no longer
      describes the number of unfilled slots in the output, we can rip all
      that out and use cur_ext for the overflow check directly.
      
      Failure to do this causes heap corruption in bmapx callers such as
      xfs_io and xfs_scrub.  xfs/328 can reproduce this problem.
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      c364b6d0
  10. 30 11月, 2016 1 次提交
  11. 08 11月, 2016 2 次提交
    • E
      xfs: provide helper for counting extents from if_bytes · 5d829300
      Eric Sandeen 提交于
      The open-coded pattern:
      
      ifp->if_bytes / (uint)sizeof(xfs_bmbt_rec_t)
      
      is all over the xfs code; provide a new helper
      xfs_iext_count(ifp) to count the number of inline extents
      in an inode fork.
      
      [dchinner: pick up several missed conversions]
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5d829300
    • E
      xfs: fix up xfs_swap_extent_forks inline extent handling · 4dfce57d
      Eric Sandeen 提交于
      There have been several reports over the years of NULL pointer
      dereferences in xfs_trans_log_inode during xfs_fsr processes,
      when the process is doing an fput and tearing down extents
      on the temporary inode, something like:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      PID: 29439  TASK: ffff880550584fa0  CPU: 6   COMMAND: "xfs_fsr"
          [exception RIP: xfs_trans_log_inode+0x10]
       #9 [ffff8800a57bbbe0] xfs_bunmapi at ffffffffa037398e [xfs]
      #10 [ffff8800a57bbce8] xfs_itruncate_extents at ffffffffa0391b29 [xfs]
      #11 [ffff8800a57bbd88] xfs_inactive_truncate at ffffffffa0391d0c [xfs]
      #12 [ffff8800a57bbdb8] xfs_inactive at ffffffffa0392508 [xfs]
      #13 [ffff8800a57bbdd8] xfs_fs_evict_inode at ffffffffa035907e [xfs]
      #14 [ffff8800a57bbe00] evict at ffffffff811e1b67
      #15 [ffff8800a57bbe28] iput at ffffffff811e23a5
      #16 [ffff8800a57bbe58] dentry_kill at ffffffff811dcfc8
      #17 [ffff8800a57bbe88] dput at ffffffff811dd06c
      #18 [ffff8800a57bbea8] __fput at ffffffff811c823b
      #19 [ffff8800a57bbef0] ____fput at ffffffff811c846e
      #20 [ffff8800a57bbf00] task_work_run at ffffffff81093b27
      #21 [ffff8800a57bbf30] do_notify_resume at ffffffff81013b0c
      #22 [ffff8800a57bbf50] int_signal at ffffffff8161405d
      
      As it turns out, this is because the i_itemp pointer, along
      with the d_ops pointer, has been overwritten with zeros
      when we tear down the extents during truncate.  When the in-core
      inode fork on the temporary inode used by xfs_fsr was originally
      set up during the extent swap, we mistakenly looked at di_nextents
      to determine whether all extents fit inline, but this misses extents
      generated by speculative preallocation; we should be using if_bytes
      instead.
      
      This mistake corrupts the in-memory inode, and code in
      xfs_iext_remove_inline eventually gets bad inputs, causing
      it to memmove and memset incorrect ranges; this became apparent
      because the two values in ifp->if_u2.if_inline_ext[1] contained
      what should have been in d_ops and i_itemp; they were memmoved due
      to incorrect array indexing and then the original locations
      were zeroed with memset, again due to an array overrun.
      
      Fix this by properly using i_df.if_bytes to determine the number
      of extents, not di_nextents.
      
      Thanks to dchinner for looking at this with me and spotting the
      root cause.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      4dfce57d
  12. 06 10月, 2016 7 次提交
  13. 26 9月, 2016 1 次提交
    • D
      xfs: remote attribute blocks aren't really userdata · 292378ed
      Dave Chinner 提交于
      When adding a new remote attribute, we write the attribute to the
      new extent before the allocation transaction is committed. This
      means we cannot reuse busy extents as that violates crash
      consistency semantics. Hence we currently treat remote attribute
      extent allocation like userdata because it has the same overwrite
      ordering constraints as userdata.
      
      Unfortunately, this also allows the allocator to incorrectly apply
      extent size hints to the remote attribute extent allocation. This
      results in interesting failures, such as transaction block
      reservation overruns and in-memory inode attribute fork corruption.
      
      To fix this, we need to separate the busy extent reuse configuration
      from the userdata configuration. This changes the definition of
      XFS_BMAPI_METADATA slightly - it now means that allocation is
      metadata and reuse of busy extents is acceptible due to the metadata
      ordering semantics of the journal. If this flag is not set, it
      means the allocation is that has unordered data writeback, and hence
      busy extent reuse is not allowed. It no longer implies the
      allocation is for user data, just that the data write will not be
      strictly ordered. This matches the semantics for both user data
      and remote attribute block allocation.
      
      As such, This patch changes the "userdata" field to a "datatype"
      field, and adds a "no busy reuse" flag to the field.
      When we detect an unordered data extent allocation, we immediately set
      the no reuse flag. We then set the "user data" flags based on the
      inode fork we are allocating the extent to. Hence we only set
      userdata flags on data fork allocations now and consider attribute
      fork remote extents to be an unordered metadata extent.
      
      The result is that remote attribute extents now have the expected
      allocation semantics, and the data fork allocation behaviour is
      completely unchanged.
      
      It should be noted that there may be other ways to fix this (e.g.
      use ordered metadata buffers for the remote attribute extent data
      write) but they are more invasive and difficult to validate both
      from a design and implementation POV. Hence this patch takes the
      simple, obvious route to fixing the problem...
      Reported-and-tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      292378ed
  14. 03 8月, 2016 6 次提交
  15. 21 6月, 2016 3 次提交
  16. 01 6月, 2016 2 次提交
  17. 19 5月, 2016 1 次提交
  18. 06 4月, 2016 1 次提交
    • C
      xfs: better xfs_trans_alloc interface · 253f4911
      Christoph Hellwig 提交于
      Merge xfs_trans_reserve and xfs_trans_alloc into a single function call
      that returns a transaction with all the required log and block reservations,
      and which allows passing transaction flags directly to avoid the cumbersome
      _xfs_trans_alloc interface.
      
      While we're at it we also get rid of the transaction type argument that has
      been superflous since we stopped supporting the non-CIL logging mode.  The
      guts of it will be removed in another patch.
      
      [dchinner: fixed transaction leak in error path in xfs_setattr_nonsize]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      253f4911
  19. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  20. 28 2月, 2016 1 次提交
    • R
      dax: give DAX clearing code correct bdev · 20a90f58
      Ross Zwisler 提交于
      dax_clear_blocks() needs a valid struct block_device and previously it
      was using inode->i_sb->s_bdev in all cases.  This is correct for normal
      inodes on mounted ext2, ext4 and XFS filesystems, but is incorrect for
      DAX raw block devices and for XFS real-time devices.
      
      Instead, rename dax_clear_blocks() to dax_clear_sectors(), and change
      its arguments to take a bdev and a sector instead of an inode and a
      block.  This better reflects what the function does, and it allows the
      filesystem and raw block device code to pass in an appropriate struct
      block_device.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20a90f58