1. 09 2月, 2016 7 次提交
  2. 11 1月, 2016 1 次提交
    • E
      xfs: eliminate committed arg from xfs_bmap_finish · f6106efa
      Eric Sandeen 提交于
      Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the
      associated comments were replicated several times across
      the attribute code, all dealing with what to do if the
      transaction was or wasn't committed.
      
      And in that replicated code, an ASSERT() test of an
      uninitialized variable occurs in several locations:
      
      	error = xfs_attr_thing(&args);
      	if (!error) {
      		error = xfs_bmap_finish(&args.trans, args.flist,
      					&committed);
      	}
      	if (error) {
      		ASSERT(committed);
      
      If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish,
      never set "committed", and then test it in the ASSERT.
      
      Fix this up by moving the committed state internal to xfs_bmap_finish,
      and add a new inode argument.  If an inode is passed in, it is passed
      through to __xfs_trans_roll() and joined to the transaction there if
      the transaction was committed.
      
      xfs_qm_dqalloc() was a little unique in that it called bjoin rather
      than ijoin, but as Dave points out we can detect the committed state
      but checking whether (*tpp != tp).
      
      Addresses-Coverity-Id: 102360
      Addresses-Coverity-Id: 102361
      Addresses-Coverity-Id: 102363
      Addresses-Coverity-Id: 102364
      Signed-off-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f6106efa
  3. 04 1月, 2016 2 次提交
    • D
      xfs: introduce per-inode DAX enablement · 58f88ca2
      Dave Chinner 提交于
      Rather than just being able to turn DAX on and off via a mount
      option, some applications may only want to enable DAX for certain
      performance critical files in a filesystem.
      
      This patch introduces a new inode flag to enable DAX in the v3 inode
      di_flags2 field. It adds support for setting and clearing flags in
      the di_flags2 field via the XFS_IOC_FSSETXATTR ioctl, and sets the
      S_DAX inode flag appropriately when it is seen.
      
      When this flag is set on a directory, it acts as an "inherit flag".
      That is, inodes created in the directory will automatically inherit
      the on-disk inode DAX flag, enabling administrators to set up
      directory heirarchies that automatically use DAX. Setting this flag
      on an empty root directory will make the entire filesystem use DAX
      by default.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      58f88ca2
    • D
      xfs: use FS_XFLAG definitions directly · e7b89481
      Dave Chinner 提交于
      Now that the ioctls have been hoisted up to the VFS level, use
      the VFs definitions directly and remove the XFS specific definitions
      completely. Userspace is going to have to handle the change of this
      interface separately, so removing the definitions from xfs_fs.h is
      not an issue here at all.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      e7b89481
  4. 03 11月, 2015 1 次提交
    • D
      xfs: optimise away log forces on timestamp updates for fdatasync · fc0561ce
      Dave Chinner 提交于
      xfs: timestamp updates cause excessive fdatasync log traffic
      
      Sage Weil reported that a ceph test workload was writing to the
      log on every fdatasync during an overwrite workload. Event tracing
      showed that the only metadata modification being made was the
      timestamp updates during the write(2) syscall, but fdatasync(2)
      is supposed to ignore them. The key observation was that the
      transactions in the log all looked like this:
      
      INODE: #regs: 4   ino: 0x8b  flags: 0x45   dsize: 32
      
      And contained a flags field of 0x45 or 0x85, and had data and
      attribute forks following the inode core. This means that the
      timestamp updates were triggering dirty relogging of previously
      logged parts of the inode that hadn't yet been flushed back to
      disk.
      
      There are two parts to this problem. The first is that XFS relogs
      dirty regions in subsequent transactions, so it carries around the
      fields that have been dirtied since the last time the inode was
      written back to disk, not since the last time the inode was forced
      into the log.
      
      The second part is that on v5 filesystems, the inode change count
      update during inode dirtying also sets the XFS_ILOG_CORE flag, so
      on v5 filesystems this makes a timestamp update dirty the entire
      inode.
      
      As a result when fdatasync is run, it looks at the dirty fields in
      the inode, and sees more than just the timestamp flag, even though
      the only metadata change since the last fdatasync was just the
      timestamps. Hence we force the log on every subsequent fdatasync
      even though it is not needed.
      
      To fix this, add a new field to the inode log item that tracks
      changes since the last time fsync/fdatasync forced the log to flush
      the changes to the journal. This flag is updated when we dirty the
      inode, but we do it before updating the change count so it does not
      carry the "core dirty" flag from timestamp updates. The fields are
      zeroed when the inode is marked clean (due to writeback/freeing) or
      when an fsync/datasync forces the log. Hence if we only dirty the
      timestamps on the inode between fsync/fdatasync calls, the fdatasync
      will not trigger another log force.
      
      Over 100 runs of the test program:
      
      Ext4 baseline:
      	runtime: 1.63s +/- 0.24s
      	avg lat: 1.59ms +/- 0.24ms
      	iops: ~2000
      
      XFS, vanilla kernel:
              runtime: 2.45s +/- 0.18s
      	avg lat: 2.39ms +/- 0.18ms
      	log forces: ~400/s
      	iops: ~1000
      
      XFS, patched kernel:
              runtime: 1.49s +/- 0.26s
      	avg lat: 1.46ms +/- 0.25ms
      	log forces: ~30/s
      	iops: ~1500
      Reported-by: NSage Weil <sage@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      fc0561ce
  5. 12 10月, 2015 1 次提交
    • B
      xfs: per-filesystem stats counter implementation · ff6d6af2
      Bill O'Donnell 提交于
      This patch modifies the stats counting macros and the callers
      to those macros to properly increment, decrement, and add-to
      the xfs stats counts. The counts for global and per-fs stats
      are correctly advanced, and cleared by writing a "1" to the
      corresponding clear file.
      
      global counts: /sys/fs/xfs/stats/stats
      per-fs counts: /sys/fs/xfs/sda*/stats/stats
      
      global clear:  /sys/fs/xfs/stats/stats_clear
      per-fs clear:  /sys/fs/xfs/sda*/stats/stats_clear
      
      [dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around
       stats structures and macros. ]
      Signed-off-by: NBill O'Donnell <billodo@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      ff6d6af2
  6. 25 8月, 2015 1 次提交
  7. 20 8月, 2015 1 次提交
  8. 19 8月, 2015 4 次提交
    • D
      xfs: stop holding ILOCK over filldir callbacks · dbad7c99
      Dave Chinner 提交于
      The recent change to the readdir locking made in 40194ecc ("xfs:
      reinstate the ilock in xfs_readdir") for CXFS directory sanity was
      probably the wrong thing to do. Deep in the readdir code we
      can take page faults in the filldir callback, and so taking a page
      fault while holding an inode ilock creates a new set of locking
      issues that lockdep warns all over the place about.
      
      The locking order for regular inodes w.r.t. page faults is io_lock
      -> pagefault -> mmap_sem -> ilock. The directory readdir code now
      triggers ilock -> page fault -> mmap_sem. While we cannot deadlock
      at this point, it inverts all the locking patterns that lockdep
      normally sees on XFS inodes, and so triggers lockdep. We worked
      around this with commit 93a8614e ("xfs: fix directory inode iolock
      lockdep false positive"), but that then just moved the lockdep
      warning to deeper in the page fault path and triggered on security
      inode locks. Fixing the shmem issue there just moved the lockdep
      reports somewhere else, and now we are getting false positives from
      filesystem freezing annotations getting confused.
      
      Further, if we enter memory reclaim in a readdir path, we now get
      lockdep warning about potential deadlocks because the ilock is held
      when we enter reclaim. This, again, is different to a regular file
      in that we never allow memory reclaim to run while holding the ilock
      for regular files. Hence lockdep now throws
      ilock->kmalloc->reclaim->ilock warnings.
      
      Basically, the problem is that the ilock is being used to protect
      the directory data and the inode metadata, whereas for a regular
      file the iolock protects the data and the ilock protects the
      metadata. From the VFS perspective, the i_mutex serialises all
      accesses to the directory data, and so not holding the ilock for
      readdir doesn't matter. The issue is that CXFS doesn't access
      directory data via the VFS, so it has no "data serialisaton"
      mechanism. Hence we need to hold the IOLOCK in the correct places to
      provide this low level directory data access serialisation.
      
      The ilock can then be used just when the extent list needs to be
      read, just like we do for regular files. The directory modification
      code can take the iolock exclusive when the ilock is also taken,
      and this then ensures that readdir is correct excluded while
      modifications are in progress.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      dbad7c99
    • D
      xfs: clean up inode lockdep annotations · 0952c818
      Dave Chinner 提交于
      Lockdep annotations are a maintenance nightmare. Locking has to be
      modified to suit the limitations of the annotations, and we're
      always having to fix the annotations because they are unable to
      express the complexity of locking heirarchies correctly.
      
      So, next up, we've got more issues with lockdep annotations for
      inode locking w.r.t. XFS_LOCK_PARENT:
      
      	- lockdep classes are exclusive and can't be ORed together
      	  to form new classes.
      	- IOLOCK needs multiple PARENT subclasses to express the
      	  changes needed for the readdir locking rework needed to
      	  stop the endless flow of lockdep false positives involving
      	  readdir calling filldir under the ILOCK.
      	- there are only 8 unique lockdep subclasses available,
      	  so we can't create a generic solution.
      
      IOWs we need to treat the 3-bit space available to each lock type
      differently:
      
      	- IOLOCK uses xfs_lock_two_inodes(), so needs:
      		- at least 2 IOLOCK subclasses
      		- at least 2 IOLOCK_PARENT subclasses
      	- MMAPLOCK uses xfs_lock_two_inodes(), so needs:
      		- at least 2 MMAPLOCK subclasses
      	- ILOCK uses xfs_lock_inodes with up to 5 inodes, so needs:
      		- at least 5 ILOCK subclasses
      		- one ILOCK_PARENT subclass
      		- one RTBITMAP subclass
      		- one RTSUM subclass
      
      For the IOLOCK, split the space into two sets of subclasses.
      For the MMAPLOCK, just use half the space for the one subclass to
      match the non-parent lock classes of the IOLOCK.
      For the ILOCK, use 0-4 as the ILOCK subclasses, 5-7 for the
      remaining individual subclasses.
      
      Because they are now all different, modify xfs_lock_inumorder() to
      handle the nested subclasses, and to assert fail if passed an
      invalid subclass. Further, annotate xfs_lock_inodes() to assert fail
      if an invalid combination of lock primitives and inode counts are
      passed that would result in a lockdep subclass annotation overflow.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0952c818
    • D
      xfs: fix sb_meta_uuid usage · bbf155ad
      Dave Chinner 提交于
      After changing the UUID on a v5 filesystem, xfstests fails
      immediately on a debug kernel with:
      
      XFS: Assertion failed: uuid_equal(&ip->i_d.di_uuid, &mp->m_sb.sb_uuid), file: fs/xfs/xfs_inode.c, line: 799
      
      This needs to check against the sb_meta_uuid, not the user visible
      UUID that was changed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      bbf155ad
    • B
      xfs: add missing bmap cancel calls in error paths · d4a97a04
      Brian Foster 提交于
      If a failure occurs after the bmap free list is populated and before
      xfs_bmap_finish() completes successfully (which returns a partial
      list on failure), the bmap free list must be cancelled. Otherwise,
      the extent items on the list are never freed and a memory leak
      occurs.
      
      Several random error paths throughout the code suffer this problem.
      Fix these up such that xfs_bmap_cancel() is always called on error.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      d4a97a04
  9. 29 7月, 2015 1 次提交
  10. 22 6月, 2015 1 次提交
  11. 04 6月, 2015 4 次提交
    • C
      xfs: saner xfs_trans_commit interface · 70393313
      Christoph Hellwig 提交于
      The flags argument to xfs_trans_commit is not useful for most callers, as
      a commit of a transaction without a permanent log reservation must pass
      0 here, and all callers for a transaction with a permanent log reservation
      except for xfs_trans_roll must pass XFS_TRANS_RELEASE_LOG_RES.  So remove
      the flags argument from the public xfs_trans_commit interfaces, and
      introduce low-level __xfs_trans_commit variant just for xfs_trans_roll
      that regrants a log reservation instead of releasing it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      70393313
    • C
      xfs: remove the flags argument to xfs_trans_cancel · 4906e215
      Christoph Hellwig 提交于
      xfs_trans_cancel takes two flags arguments: XFS_TRANS_RELEASE_LOG_RES and
      XFS_TRANS_ABORT.  Both of them are a direct product of the transaction
      state, and can be deducted:
      
       - any dirty transaction needs XFS_TRANS_ABORT to be properly canceled,
         and XFS_TRANS_ABORT is a noop for a transaction that is not dirty.
       - any transaction with a permanent log reservation needs
         XFS_TRANS_RELEASE_LOG_RES to be properly canceled, and passing
         XFS_TRANS_RELEASE_LOG_RES for a transaction without a permanent
         log reservation is invalid.
      
      So just remove the flags argument and do the right thing.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      4906e215
    • C
      xfs: switch remaining xfs_trans_dup users to xfs_trans_roll · 2e6db6c4
      Christoph Hellwig 提交于
      We have three remaining callers of xfs_trans_dup:
      
       - xfs_itruncate_extents which open codes xfs_trans_roll
       - xfs_bmap_finish doesn't have an xfs_inode argument and thus leaves
         attaching them to it's callers, but otherwise is identical to
         xfs_trans_roll
       - xfs_dir_ialloc looks at the log reservations in the old xfs_trans
         structure instead of the log reservation parameters, but otherwise
         is identical to xfs_trans_roll.
      
      By allowing a NULL xfs_inode argument to xfs_trans_roll we can switch
      these three remaining users over to xfs_trans_roll and mark xfs_trans_dup
      static.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2e6db6c4
    • B
      xfs: fix sparse inodes 32-bit compile failure · 3cdaa189
      Brian Foster 提交于
      The kbuild test robot reports the following compilation failure with a
      32-bit kernel configuration:
      
      	fs/built-in.o: In function `xfs_ifree_cluster':
      	>> xfs_inode.c:(.text+0x17ac84): undefined reference to `__umoddi3'
      
      This is due to the use of the modulus operator on a 64-bit variable in
      the ASSERT() added as part of the following commit:
      
      	xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster()
      
      This ASSERT() simply checks that the offset of the inode in a sparse
      cluster is appropriately aligned. Since the maximum inode record offset
      is 63 (for a 64 inode record) and the calculated offset here should be
      something less than that, just use a 32-bit variable to store the offset
      and call the do_mod() helper.
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      3cdaa189
  12. 29 5月, 2015 3 次提交
    • B
      xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster() · 09b56604
      Brian Foster 提交于
      xfs_ifree_cluster() is called to mark all in-memory inodes and inode
      buffers as stale. This occurs after we've removed the inobt records and
      dropped any references of inobt data. xfs_ifree_cluster() uses the
      starting inode number to walk the namespace of inodes expected for a
      single chunk a cluster buffer at a time. The cluster buffer disk
      addresses are calculated by decoding the sequential inode numbers
      expected from the chunk.
      
      The problem with this approach is that if the inode chunk being removed
      is a sparse chunk, not all of the buffer addresses that are calculated
      as part of this sequence may be inode clusters. Attempting to acquire
      the buffer based on expected inode characterstics (i.e., cluster length)
      can lead to errors and is generally incorrect.
      
      We already use a couple variables to carry requisite state from
      xfs_difree() to xfs_ifree_cluster(). Rather than add a third, define a
      new internal structure to carry the existing parameters through these
      functions. Add an alloc field that represents the physical allocation
      bitmap of inodes in the chunk being removed. Modify xfs_ifree_cluster()
      to check each inode against the bitmap and skip the clusters that were
      never allocated as real inodes on disk.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      09b56604
    • B
      xfs: fix broken i_nlink accounting for whiteout tmpfile inode · 22419ac9
      Brian Foster 提交于
      XFS uses the internal tmpfile() infrastructure for the whiteout inode
      used for RENAME_WHITEOUT operations. For tmpfile inodes, XFS allocates
      the inode, drops di_nlink, adds the inode to the agi unlinked list,
      calls d_tmpfile() which correspondingly drops i_nlink of the vfs inode,
      and then finishes the common inode setup (e.g., clear I_NEW and unlock).
      
      The d_tmpfile() call was originally made inxfs_create_tmpfile(), but was
      pulled up out of that function as part of the following commit to
      resolve a deadlock issue:
      
      	330033d6 xfs: fix tmpfile/selinux deadlock and initialize security
      
      As a result, callers of xfs_create_tmpfile() are responsible for either
      calling d_tmpfile() or fixing up i_nlink appropriately. The whiteout
      tmpfile allocation helper does neither. As a result, the vfs ->i_nlink
      becomes inconsistent with the on-disk ->di_nlink once xfs_rename() links
      it back into the source dentry and calls xfs_bumplink().
      
      Update the assert in xfs_rename() to help detect this problem in the
      future and update xfs_rename_alloc_whiteout() to decrement the link
      count as part of the manual tmpfile inode setup.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      22419ac9
    • D
      xfs: xfs_attr_inactive leaves inconsistent attr fork state behind · 6dfe5a04
      Dave Chinner 提交于
      xfs_attr_inactive() is supposed to clean up the attribute fork when
      the inode is being freed. While it removes attribute fork extents,
      it completely ignores attributes in local format, which means that
      there can still be active attributes on the inode after
      xfs_attr_inactive() has run.
      
      This leads to problems with concurrent inode writeback - the in-core
      inode attribute fork is removed without locking on the assumption
      that nothing will be attempting to access the attribute fork after a
      call to xfs_attr_inactive() because it isn't supposed to exist on
      disk any more.
      
      To fix this, make xfs_attr_inactive() completely remove all traces
      of the attribute fork from the inode, regardless of it's state.
      Further, also remove the in-core attribute fork structure safely so
      that there is nothing further that needs to be done by callers to
      clean up the attribute fork. This means we can remove the in-core
      and on-disk attribute forks atomically.
      
      Also, on error simply remove the in-memory attribute fork. There's
      nothing that can be done with it once we have failed to remove the
      on-disk attribute fork, so we may as well just blow it away here
      anyway.
      
      cc: <stable@vger.kernel.org> # 3.12 to 4.0
      Reported-by: NWaiman Long <waiman.long@hp.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      6dfe5a04
  13. 25 3月, 2015 6 次提交
    • F
      xfs: use bool instead of int in xfs_rename() · 86aaf02e
      Fabian Frederick 提交于
      new_parent and src_is_directory are only used in 0/1 context.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      86aaf02e
    • D
      xfs: add RENAME_WHITEOUT support · 7dcf5c3e
      Dave Chinner 提交于
      Whiteouts are used by overlayfs -  it has a crazy convention that a
      whiteout is a character device inode with a major:minor of 0:0.
      Because it's not documented anywhere, here's an example of what
      RENAME_WHITEOUT does on ext4:
      
      # echo foo > /mnt/scratch/foo
      # echo bar > /mnt/scratch/bar
      # ls -l /mnt/scratch
      total 24
      -rw-r--r-- 1 root root     4 Feb 11 20:22 bar
      -rw-r--r-- 1 root root     4 Feb 11 20:22 foo
      drwx------ 2 root root 16384 Feb 11 20:18 lost+found
      # src/renameat2 -w /mnt/scratch/foo /mnt/scratch/bar
      # ls -l /mnt/scratch
      total 20
      -rw-r--r-- 1 root root     4 Feb 11 20:22 bar
      c--------- 1 root root  0, 0 Feb 11 20:23 foo
      drwx------ 2 root root 16384 Feb 11 20:18 lost+found
      # cat /mnt/scratch/bar
      foo
      #
      
      In XFS rename terms, the operation that has been done is that source
      (foo) has been moved to the target (bar), which is like a nomal
      rename operation, but rather than the source being removed, it have
      been replaced with a whiteout.
      
      We can't allocate whiteout inodes within the rename transaction due
      to allocation being a multi-commit transaction: rename needs to
      be a single, atomic commit. Hence we have several options here, form
      most efficient to least efficient:
      
          - use DT_WHT in the target dirent and do no whiteout inode
            allocation.  The main issue with this approach is that we need
            hooks in lookup to create a virtual chardev inode to present
            to userspace and in places where we might need to modify the
            dirent e.g. unlink.  Overlayfs also needs to be taught about
            DT_WHT. Most invasive change, lowest overhead.
      
          - create a special whiteout inode in the root directory (e.g. a
            ".wino" dirent) and then hardlink every new whiteout to it.
            This means we only need to create a single whiteout inode, and
            rename simply creates a hardlink to it. We can use DT_WHT for
            these, though using DT_CHR means we won't have to modify
            overlayfs, nor anything in userspace. Downside is we have to
            look up the whiteout inode on every operation and create it if
            it doesn't exist.
      
          - copy ext4: create a special whiteout chardev inode for every
            whiteout.  This is more complex than the above options because
            of the lack of atomicity between inode creation and the rename
            operation, requiring us to create a tmpfile inode and then
            linking it into the directory structure during the rename. At
            least with a tmpfile inode crashes between the create and
            rename doesn't leave unreferenced inodes or directory
            pollution around.
      
      By far the simplest thing to do in the short term is to copy ext4.
      While it is the most inefficient way of supporting whiteouts, but as
      an initial implementation we can simply reuse existing functions and
      add a small amount of extra code the the rename operation.
      
      When we get full whiteout support in the VFS (via the dentry cache)
      we can then look to supporting DT_WHT method outlined as the first
      method of supporting whiteouts. But until then, we'll stick with
      what overlayfs expects us to be: dumb and stupid.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      7dcf5c3e
    • D
      xfs: make xfs_cross_rename() complete fully · eeacd321
      Dave Chinner 提交于
      Now that xfs_finish_rename() exists, there is no reason for
      xfs_cross_rename() to return to xfs_rename() to finish off the
      rename transaction. Drive the completion code into
      xfs_cross_rename() and handle all errors there so as to simplify
      the xfs_rename() code.
      
      Further, push the rename exchange target_ip check to early in the
      rename code so as to make the error handling easy and obviously
      correct.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      eeacd321
    • D
      xfs: factor out xfs_finish_rename() · 310606b0
      Dave Chinner 提交于
      Rather than use a jump label for the final transaction commit in
      the rename, factor it into a simple helper function and call it
      appropriately. This slightly reduces the spaghetti nature of
      xfs_rename.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      310606b0
    • D
      xfs: cleanup xfs_rename error handling · 445883e8
      Dave Chinner 提交于
      The jump labels are ambiguous and unclear and some of the error
      paths are used inconsistently. Rules for error jumps are:
      
      - use out_trans_cancel for unmodified transaction context
      - use out_bmap_cancel on ENOSPC errors
      - use out_trans_abort when transaction is likely to be dirty.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      445883e8
    • D
      xfs: clean up inode locking for RENAME_WHITEOUT · 95afcf5c
      Dave Chinner 提交于
      When doing RENAME_WHITEOUT, we now have to lock 5 inodes into the
      rename transaction. This means we need to update
      xfs_sort_for_rename() and xfs_lock_inodes() to handle up to 5
      inodes. Because of the vagaries of rename, this means we could have
      anywhere between 3 and 5 inodes locked into the transaction....
      
      While xfs_lock_inodes() does not need anything other than an assert
      telling us we are passing more inodes that we ever thought we should
      see, it could do with a logic rework to remove all the indenting.
      This is not a functional change - it just makes the code a lot
      easier to read.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      95afcf5c
  14. 24 2月, 2015 1 次提交
  15. 23 2月, 2015 2 次提交
    • D
      xfs: inodes are new until the dentry cache is set up · 58c90473
      Dave Chinner 提交于
      Al Viro noticed a generic set of issues to do with filehandle lookup
      racing with dentry cache setup. They involve a filehandle lookup
      occurring while an inode is being created and the filehandle lookup
      racing with the dentry creation for the real file. This can lead to
      multiple dentries for the one path being instantiated. There are a
      host of other issues around this same set of paths.
      
      The underlying cause is that file handle lookup only waits on inode
      cache instantiation rather than full dentry cache instantiation. XFS
      is mostly immune to the problems discovered due to it's own internal
      inode cache, but there are a couple of corner cases where races can
      happen.
      
      We currently clear the XFS_INEW flag when the inode is fully set up
      after insertion into the cache. Newly allocated inodes are inserted
      locked and so aren't usable until the allocation transaction
      commits. This, however, occurs before the dentry and security
      information is fully initialised and hence the inode is unlocked and
      available for lookups to find too early.
      
      To solve the problem, only clear the XFS_INEW flag for newly created
      inodes once the dentry is fully instantiated. This means lookups
      will retry until the XFS_INEW flag is removed from the inode and
      hence avoids the race conditions in questions.
      
      THis also means that xfs_create(), xfs_create_tmpfile() and
      xfs_symlink() need to finish the setup of the inode in their error
      paths if we had allocated the inode but failed later in the creation
      process. xfs_symlink(), in particular, needed a lot of help to make
      it's error handling match that of xfs_create().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      58c90473
    • D
      xfs: introduce mmap/truncate lock · 653c60b6
      Dave Chinner 提交于
      Right now we cannot serialise mmap against truncate or hole punch
      sanely. ->page_mkwrite is not able to take locks that the read IO
      path normally takes (i.e. the inode iolock) because that could
      result in lock inversions (read - iolock - page fault - page_mkwrite
      - iolock) and so we cannot use an IO path lock to serialise page
      write faults against truncate operations.
      
      Instead, introduce a new lock that is used *only* in the
      ->page_mkwrite path that is the equivalent of the iolock. The lock
      ordering in a page fault is i_mmaplock -> page lock -> i_ilock,
      and so in truncate we can i_iolock -> i_mmaplock and so lock out
      new write faults during the process of truncation.
      
      Because i_mmap_lock is outside the page lock, we can hold it across
      all the same operations we hold the i_iolock for. The only
      difference is that we never hold the i_mmaplock in the normal IO
      path and so do not ever have the possibility that we can page fault
      inside it. Hence there are no recursion issues on the i_mmap_lock
      and so we can use it to serialise page fault IO against inode
      modification operations that affect the IO path.
      
      This patch introduces the i_mmaplock infrastructure, lockdep
      annotations and initialisation/destruction code. Use of the new lock
      will be in subsequent patches.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      653c60b6
  16. 22 1月, 2015 1 次提交
  17. 24 12月, 2014 1 次提交
  18. 04 12月, 2014 1 次提交
  19. 28 11月, 2014 1 次提交